Eric Lee / smarc-fsl-linux-kernel

16 Apr, 2016

1 commit

2e5725991 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"A few fixes for the current series. This contains:

- Two fixes for NVMe:

One fixes a reset race that can be triggered by repeated
insert/removal of the module.

The other fixes an issue on some platforms, where we get probe
timeouts since legacy interrupts isn't working. This used not to
be a problem since we had the worker thread poll for completions,
but since that was killed off, it means those poor souls can't
successfully probe their NVMe device. Use a proper IRQ check and
probe (msi-x -> msi ->legacy), like most other drivers to work
around this. Both from Keith.

- A loop corruption issue with offset in iters, from Ming Lei.

- A fix for not having the partition stat per cpu ref count
initialized before sending out the KOBJ_ADD, which could cause user
space to access the counter prior to initialization. Also from
Ming Lei.

- A fix for using the wrong congestion state, from Kaixu Xia"

* 'for-linus' of git://git.kernel.dk/linux-block:
block: loop: fix filesystem corruption in case of aio/dio
NVMe: Always use MSI/MSI-x interrupts
NVMe: Fix reset/remove race
writeback: fix the wrong congested state variable definition
block: partition: initialize percpuref before sending out KOBJ_ADD

Linus Torvalds
2016-04-16 06:44:10 +0800

05 Apr, 2016

2 commits

ea1754a08 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage ... Browse Code »

Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800
09cbfeaf1 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros ... Browse Code »

PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized. And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special. They are
not.

The changes are pretty straight-forward:

- << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

- page_cache_get() -> get_page();

- page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

30 Mar, 2016

1 commit

b30a337ca block: partition: initialize percpuref before sending out KOBJ_ADD ... Browse Code »

The initialization of partition's percpu_ref should have been done before
sending out KOBJ_ADD uevent, which may cause userspace to read partition
table. So the uninitialized percpu_ref may be accessed in data path.

This patch fixes this issue reported by Naveen.

Reported-by: Naveen Kaje
Tested-by: Naveen Kaje
Fixes: 6c71013ecb7e2(block: partition: convert percpu ref)
Cc: # v4.3+
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2016-03-30 09:18:14 +0800

25 Mar, 2016

1 commit

1d02369db Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"Final round of fixes for this merge window - some of this has come up
after the initial pull request, and some of it was put in a post-merge
branch before the merge window.

This contains:

- Fix for a bad check for an error on dma mapping in the mtip32xx
driver, from Alexey Khoroshilov.

- A set of fixes for lightnvm, from Javier, Matias, and Wenwei.

- An NVMe completion record corruption fix from Marta, ensuring that
we read things in the right order.

- Two writeback fixes from Tejun, marked for stable@ as well.

- A blk-mq sw queue iterator fix from Thomas, fixing an oops for
sparse CPU maps. They hit this in the hot plug/unplug rework"

* 'for-linus' of git://git.kernel.dk/linux-block:
nvme: avoid cqe corruption when update at the same time as read
writeback, cgroup: fix use of the wrong bdi_writeback which mismatches the inode
writeback, cgroup: fix premature wb_put() in locked_inode_to_wb_and_lock_list()
blk-mq: Use proper cpumask iterator
mtip32xx: fix checks for dma mapping errors
lightnvm: do not load L2P table if not supported
lightnvm: do not reserve lun on l2p loading
nvme: lightnvm: return ppa completion status
lightnvm: add a bitmap of luns
lightnvm: specify target's logical address area
null_blk: add lightnvm null_blk device to the nullb_list

Linus Torvalds
2016-03-25 11:00:44 +0800

20 Mar, 2016

1 commit

897bb0c7f blk-mq: Use proper cpumask iterator ... Browse Code »

queue_for_each_ctx() iterates over per_cpu variables under the assumption that
the possible cpu mask cannot have holes. That's wrong as all cpumasks can have
holes. In case there are holes the iteration ends up accessing uninitialized
memory and crashing as a result.

Replace the macro by a proper for_each_possible_cpu() loop and drop the unused
macro blk_ctx_sum() which references queue_for_each_ctx().

Reported-by: Xiong Zhou
Signed-off-by: Thomas Gleixner
Signed-off-by: Jens Axboe

Thomas Gleixner
2016-03-20 23:34:02 +0800

19 Mar, 2016

2 commits

fcab86add Merge branch 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata ... Browse Code »

Pull libata updates from Tejun Heo:

- ahci grew runtime power management support so that the controller can
be turned off if no devices are attached.

- sata_via isn't dead yet. It got hotplug support and more refined
workaround for certain WD drives.

- Misc cleanups. There's a merge from for-4.5-fixes to avoid confusing
conflicts in ahci PCI ID table.

* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ata: ahci_xgene: dereferencing uninitialized pointer in probe
AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs
ata: sata_rcar: Use ARCH_RENESAS
sata_via: Implement hotplug for VT6421
sata_via: Apply WD workaround only when needed on VT6421
ahci: Add runtime PM support for the host controller
ahci: Add functions to manage runtime PM of AHCI ports
ahci: Convert driver to use modern PM hooks
ahci: Cache host controller version
scsi: Drop runtime PM usage count after host is added
scsi: Set request queue runtime PM status back to active on resume
block: Add blk_set_runtime_active()
ata: ahci_mvebu: add support for Armada 3700 variant
libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show()
libata: support AHCI on OCTEON platform

Linus Torvalds
2016-03-19 11:06:46 +0800
35d88d97b Merge branch 'for-4.6/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:
"Here are the core block changes for this merge window. Not a lot of
exciting stuff going on in this round, most of the changes have been
on the driver side of things. That pull request is coming next. This
pull request contains:

- A set of fixes for chained bio handling from Christoph.

- A tag bounds check for blk-mq from Hannes, ensuring that we don't
do something stupid if a device reports an invalid tag value.

- A set of fixes/updates for the CFQ IO scheduler from Jan Kara.

- A set of blk-mq fixes from Keith, adding support for dynamic
hardware queues, and fixing init of max_dev_sectors for stacking
devices.

- A fix for the dynamic hw context from Ming.

- Enabling of cgroup writeback support on a block device, from
Shaohua"

* 'for-4.6/core' of git://git.kernel.dk/linux-block:
blk-mq: add bounds check on tag-to-rq conversion
block: bio_remaining_done() isn't unlikely
block: cleanup bio_endio
block: factor out chained bio completion
block: don't unecessarily clobber bi_error for chained bios
block-dev: enable writeback cgroup support
blk-mq: Fix NULL pointer updating nr_requests
blk-mq: mark request queue as mq asap
block: Initialize max_dev_sectors to 0
blk-mq: dynamic h/w context count
cfq-iosched: Allow parent cgroup to preempt its child
cfq-iosched: Allow sync noidle workloads to preempt each other
cfq-iosched: Reorder checks in cfq_should_preempt()
cfq-iosched: Don't group_idle if cfqq has big thinktime

Linus Torvalds
2016-03-19 07:43:11 +0800

17 Mar, 2016

1 commit

6968e6f83 Merge tag 'dm-4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm ... Browse Code »

Pull device mapper updates from Mike Snitzer:

- Most attention this cycle went to optimizing blk-mq request-based DM
(dm-mq) that is used exclussively by DM multipath:

- A stable fix for dm-mq that eliminates excessive context
switching offers the biggest performance improvement (for both
IOPs and throughput).

- But more work is needed, during the next cycle, to reduce
spinlock contention in DM multipath on large NUMA systems.

- A stable fix for a NULL pointer seen when DM stats is enabled on a DM
multipath device that must requeue an IO due to path failure.

- A stable fix for DM snapshot to disallow the COW and origin devices
from being identical. This amounts to graceful failure in the face
of userspace error because these devices shouldn't ever be identical.

- Stable fixes for DM cache and DM thin provisioning to address crashes
seen if/when their respective metadata device experiences failures
that cause the transition to 'fail_io' mode.

- The DM cache 'mq' policy is now an alias for the 'smq' policy. The
'smq' policy proved to be consistently better than 'mq'. As such
'mq', with all its complex user-facing tunables, has been eliminated.

- Improve DM thin provisioning to consistently return -ENOSPC once the
thin-pool's data volume is out of space.

- Improve DM core to properly handle error propagation if
bio_integrity_clone() fails in clone_bio().

- Other small cleanups and improvements to DM core.

* tag 'dm-4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (41 commits)
dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request()
dm thin: consistently return -ENOSPC if pool has run out of data space
dm cache: bump the target version
dm cache: make sure every metadata function checks fail_io
dm: add missing newline between DM_DEBUG_BLOCK_STACK_TRACING and DM_BUFIO
dm cache policy smq: clarify that mq registration failure was for 'mq'
dm: return error if bio_integrity_clone() fails in clone_bio()
dm thin metadata: don't issue prefetches if a transaction abort has failed
dm snapshot: disallow the COW and origin devices from being identical
dm cache: make the 'mq' policy an alias for 'smq'
dm: drop unnecessary assignment of md->queue
dm: reorder 'struct mapped_device' members to fix alignment and holes
dm: remove dummy definition of 'struct dm_table'
dm: add 'dm_numa_node' module parameter
dm thin metadata: remove needless newline from subtree_dec() DMERR message
dm mpath: cleanup reinstate_path() et al based on code review
dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busy
dm mpath: switch from 'unsigned' to 'bool' for flags where appropriate
dm round robin: use percpu 'repeat_count' and 'current_path'
dm path selector: remove 'repeat_count' return from .select_path hook
...

Linus Torvalds
2016-03-17 08:26:37 +0800

16 Mar, 2016

2 commits

0d9c51a6e block: partition: add partition specific uevent callbacks for partition info ... Browse Code »

This patch has been carried in the Android tree for quite some time and
is one of the few patches required to get a mainline kernel up and
running with an exsiting Android userspace. So I wanted to submit it
for review and consideration if it should be merged.

For partitions, add new uevent parameters 'PARTN' which specifies the
partitions index in the table, and 'PARTNAME', which specifies PARTNAME
specifices the partition name of a partition device.

Android's userspace uses this for creating device node links from the
partition name and number, ie:

/dev/block/platform/soc/by-name/system
or
/dev/block/platform/soc/by-num/p1

One can see its usage here:
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
and
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494

[john.stultz@linaro.org: dropped NPARTS and reworded commit message for context]
Signed-off-by: Dima Zavin
Signed-off-by: John Stultz
Cc: Jens Axboe
Cc: Rom Lemarchand
Cc: Android Kernel Team
Cc: Jeff Moyer
Cc:
Cc: Kees Cook
Cc: Kay Sievers
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

San Mehat
2016-03-16 07:55:16 +0800
4ee86babe blk-mq: add bounds check on tag-to-rq conversion ... Browse Code »

We need to check for a valid index before accessing the array
element to avoid accessing invalid memory regions.

Reviewed-by: Christoph Hellwig
Reviewed-by: Jeff Moyer

Modified by Jens to drop the unlikely(), and make the fall through
path be having a valid tag.

Signed-off-by: Jens Axboe

Hannes Reinecke
2016-03-16 03:03:28 +0800

14 Mar, 2016

4 commits

2b8855171 block: bio_remaining_done() isn't unlikely ... Browse Code »

We use bio chaining during most I/Os these days due to the delayed
bio splitting. Additionally XFS will start using it, and there is
a pending direct I/O rewrite also making heavy use for it. Don't
pretend it's always unlikely, and let the branch predictor do it's
job instead.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-03-14 22:55:25 +0800
ba8c6967b block: cleanup bio_endio ... Browse Code »

Replace the while loop that unecessarily checks for a NULL bio in the fast
path with a simple goto loop.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-03-14 22:55:24 +0800
38f8baae8 block: factor out chained bio completion ... Browse Code »

Factor common code between bio_chain_endio and bio_endio into a common
helper.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-03-14 22:55:22 +0800
af3e3a525 block: don't unecessarily clobber bi_error for chained bios ... Browse Code »

Only overwrite the parents bi_error if it was zero. That way a successful
bio completion doesn't reset the error pointer.

Reported-by: Brian Foster
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-03-14 22:55:21 +0800

04 Mar, 2016

3 commits

e9137d4b9 blk-mq: Fix NULL pointer updating nr_requests ... Browse Code »

A h/w context's tags are freed if it was not assigned a CPU. Check if
the context has tags before updating the depth.

Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2016-03-04 05:46:04 +0800
4d6af73d9 block: support large requests in blk_rq_map_user_iov ... Browse Code »

This patch adds support for larger requests in blk_rq_map_user_iov by
allowing it to build multiple bios for a request. This functionality
used to exist for the non-vectored blk_rq_map_user in the past, and
this patch reuses the existing functionality for it on the unmap side,
which stuck around. Thanks to the iov_iter API supporting multiple
bios is fairly trivial, as we can just iterate the iov until we've
consumed the whole iov_iter.

Signed-off-by: Christoph Hellwig
Reported-by: Jeff Lien
Tested-by: Jeff Lien
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-03-04 05:45:02 +0800
e827091cb block: merge: get the 1st and last bvec via helpers ... Browse Code »

This patch applies the two introduced helpers to
figure out the 1st and last bvec.

Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2016-03-04 05:42:49 +0800

28 Feb, 2016

1 commit

03cdadb04 block: disable block device DAX by default ... Browse Code »

The recent *sync enabling discovered that we are inserting into the
block_device pagecache counter to the expectations of the dirty data
tracking for dax mappings. This can lead to data corruption.

We want to support DAX for block devices eventually, but it requires
wider changes to properly manage the pagecache.

dump_stack+0x85/0xc2
dax_writeback_mapping_range+0x60/0xe0
blkdev_writepages+0x3f/0x50
do_writepages+0x21/0x30
__filemap_fdatawrite_range+0xc6/0x100
filemap_write_and_wait+0x4a/0xa0
set_blocksize+0x70/0xd0
sb_set_blocksize+0x1d/0x50
ext4_fill_super+0x75b/0x3360
mount_bdev+0x180/0x1b0
ext4_mount+0x15/0x20
mount_fs+0x38/0x170

Mark the support broken so its disabled by default, but otherwise still
available for testing.

Signed-off-by: Dan Williams
Signed-off-by: Ross Zwisler
Reported-by: Ross Zwisler
Suggested-by: Dave Chinner
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Matthew Wilcox
Cc: Al Viro
Cc: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-02-28 02:28:52 +0800

23 Feb, 2016

1 commit

6acfe68ba dm: fix excessive dm-mq context switching ... Browse Code »

Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly. One of the
reasons for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true. This forced the
use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
which ushered in ping-ponging between process context (fio in this case)
and kblockd's kworker to submit the cloned request. The ftrace
function_graph tracer showed:

kworker-2013 => fio-12190
fio-12190 => kworker-2013
...
kworker-2013 => fio-12190
fio-12190 => kworker-2013
...

Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
_not_ use kblockd to submit the cloned requests isn't enough to
eliminate the observed context switches.

In addition to this dm-mq specific blk-core fix, there are 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1) don't blk_mq_run_hw_queues in blk-mq request completion

Motivated by desire to reduce overhead of dm-mq, punting to kblockd
just increases context switches.

In my testing against a really fast null_blk device there was no benefit
to running blk_mq_run_hw_queues() on completion (and no other blk-mq
driver does this). So hopefully this change doesn't induce the need for
yet another revert like commit 621739b00e16ca2d !

2) use blk_mq_complete_request() in dm_complete_request()

blk_complete_request() doesn't offer the traditional q->mq_ops vs
.request_fn branching pattern that other historic block interfaces
do (e.g. blk_get_request). Using blk_mq_complete_request() for
blk-mq requests is important for performance. It should be noted
that, like blk_complete_request(), blk_mq_complete_request() doesn't
natively handle partial completions -- but the request-based
DM-multipath target does provide the required partial completion
support by dm.c:end_clone_bio() triggering requeueing of the request
via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

dm-mq fix #2 is _much_ more important than #1 for eliminating the
context switches.
Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case. The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
Cc: stable@vger.kernel.org # 4.1+
Reported-by: Sagi Grimberg
Signed-off-by: Mike Snitzer
Acked-by: Jens Axboe

Mike Snitzer
2016-02-23 00:04:40 +0800

19 Feb, 2016

1 commit

d07ab6d11 block: Add blk_set_runtime_active() ... Browse Code »

If block device is left runtime suspended during system suspend, resume
hook of the driver typically corrects runtime PM status of the device back
to "active" after it is resumed. However, this is not enough as queue's
runtime PM status is still "suspended". As long as it is in this state
blk_pm_peek_request() returns NULL and thus prevents new requests to be
processed.

Add new function blk_set_runtime_active() that can be used to force the
queue status back to "active" as needed.

Signed-off-by: Mika Westerberg
Acked-by: Jens Axboe
Signed-off-by: Tejun Heo

Mika Westerberg
2016-02-19 23:52:45 +0800

18 Feb, 2016

2 commits

285071357 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block fixes from Jens Axboe:
"A collection of fixes from the past few weeks that should go into 4.5.
This contains:

- Overflow fix for sysfs discard show function from Alan.

- A stacking limit init fix for max_dev_sectors, so we don't end up
artificially capping some use cases. From Keith.

- Have blk-mq proper end unstarted requests on a dying queue, instead
of pushing that to the driver. From Keith.

- NVMe:
- Update to Kconfig description for NVME_SCSI, since it was
vague and having it on is important for some SUSE distros.
From Christoph.
- Set of fixes from Keith, around surprise removal. Also kills
the no-merge flag, so it supports merging.

- Set of fixes for lightnvm from Matias, Javier, and Wenwei.

- Fix null_blk oops when asked for lightnvm, but not available. From
Matias.

- Copy-to-user EINTR fix from Hannes, fixing a case where SG_IO fails
if interrupted by a signal.

- Two floppy fixes from Jiri, fixing signal handling and blocking
open.

- A use-after-free fix for O_DIRECT, from Mike Krinkin.

- A block module ref count fix from Roman Pen.

- An fs IO wait accounting fix for O_DSYNC from Stephane Gasparini.

- Smaller reallo fix for xen-blkfront from Bob Liu.

- Removal of an unused struct member in the deadline IO scheduler,
from Tahsin.

- Also from Tahsin, properly initialize inode struct members
associated with cgroup writeback, if enabled.

- From Tejun, ensure that we keep the superblock pinned during cgroup
writeback"

* 'for-linus' of git://git.kernel.dk/linux-block: (25 commits)
blk: fix overflow in queue_discard_max_hw_show
writeback: initialize inode members that track writeback history
writeback: keep superblock pinned during cgroup writeback association switches
bio: return EINTR if copying to user space got interrupted
NVMe: Rate limit nvme IO warnings
NVMe: Poll device while still active during remove
NVMe: Requeue requests on suspended queues
NVMe: Allow request merges
NVMe: Fix io incapable return values
blk-mq: End unstarted requests on dying queue
block: Initialize max_dev_sectors to 0
null_blk: oops when initializing without lightnvm
block: fix module reference leak on put_disk() call for cgroups throttle
nvme: fix Kconfig description for BLK_DEV_NVME_SCSI
kernel/fs: fix I/O wait not accounted for RW O_DSYNC
floppy: refactor open() flags handling
lightnvm: allow to force mm initialization
lightnvm: check overflow and correct mlc pairs
lightnvm: fix request intersection locking in rrpc
lightnvm: warn if irqs are disabled in lock laddr
...

Linus Torvalds
2016-02-18 03:59:23 +0800
18f922d03 blk: fix overflow in queue_discard_max_hw_show ... Browse Code »

We get this right for queue_discard_max_show but not max_hw_show. Follow the
same pattern as queue_discard_max_show instead so that we don't truncate.

Signed-off-by: Alan Cox
Signed-off-by: Jens Axboe

Alan
2016-02-18 01:20:42 +0800

15 Feb, 2016

1 commit

668416721 blk-mq: mark request queue as mq asap ... Browse Code »

Currently q->mq_ops is used widely to decide if the queue
is mq or not, so we should set the 'flag' asap so that both
block core and drivers can get the correct mq info.

For example, commit 868f2f0b720(blk-mq: dynamic h/w context count)
moves the hctx's initialization before setting q->mq_ops in
blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
to think the queue is non-mq and don't allocate command size
for the per-hctx flush rq.

This patches should fix the problem reported by Sasha.

Cc: Keith Busch
Reported-by: Sasha Levin
Signed-off-by: Ming Lei
Fixes: 868f2f0b720 ("blk-mq: dynamic h/w context count")
Signed-off-by: Jens Axboe

Ming Lei
2016-02-15 06:35:14 +0800

13 Feb, 2016

1 commit

4c05121e2 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull SCSI fixes from James Bottomley:
"A set of seven fixes:

Two regressions in the new hisi_sas arm driver, a blacklist entry for
the marvell console which was causing a reset cascade without it, a
race fix in the WRITE_SAME/DISCARD routines, a retry fix for the rdac
driver, without which, it would prematurely return EIO and a couple of
fixes for the hyper-v storvsc driver"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
block/sd: Return -EREMOTEIO when WRITE SAME and DISCARD are disabled
SCSI: Add Marvell Console to VPD blacklist
scsi_dh_rdac: always retry MODE SELECT on command lock violation
storvsc: Use the specified target ID in device lookup
storvsc: Install the storvsc specific timeout handler for FC devices
hisi_sas: fix v1 hw check for slot error
hisi_sas: add dependency for HAS_IOMEM

Linus Torvalds
2016-02-13 01:32:37 +0800

12 Feb, 2016

3 commits

2d99b55d3 bio: return EINTR if copying to user space got interrupted ... Browse Code »

Commit 35dc248383bbab0a7203fca4d722875bc81ef091 introduced a check for
current->mm to see if we have a user space context and only copies data
if we do. Now if an IO gets interrupted by a signal data isn't copied
into user space any more (as we don't have a user space context) but
user space isn't notified about it.

This patch modifies the behaviour to return -EINTR from bio_uncopy_user()
to notify userland that a signal has interrupted the syscall, otherwise
it could lead to a situation where the caller may get a buffer with
no data returned.

This can be reproduced by issuing SG_IO ioctl()s in one thread while
constantly sending signals to it.

Fixes: 35dc248 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
Signed-off-by: Johannes Thumshirn
Signed-off-by: Hannes Reinecke
Cc: stable@vger.kernel.org # v.3.11+
Signed-off-by: Jens Axboe

Hannes Reinecke
2016-02-12 23:17:41 +0800
a59e0f579 blk-mq: End unstarted requests on dying queue ... Browse Code »

Go directly to ending a request if it wasn't started. Previously, completing a
request may invoke a driver callback for a request it didn't initialize.

Signed-off-by: Keith Busch
Reviewed-by: Sagi Grimberg
Reviewed-by: Johannes Thumshirn
Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Keith Busch
2016-02-12 04:14:00 +0800
5f009d3f8 block: Initialize max_dev_sectors to 0 ... Browse Code »

The new queue limit is not used by the majority of block drivers, and
should be initialized to 0 for the driver's requested settings to be used.

Signed-off-by: Keith Busch
Acked-by: Martin K. Petersen
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Keith Busch
2016-02-12 04:10:39 +0800

11 Feb, 2016

1 commit

d5df731ab block: Initialize max_dev_sectors to 0 ... Browse Code »

The new queue limit is not used by the majority of block drivers, and
should be initialized to 0 for the driver's requested settings to be used.

Signed-off-by: Keith Busch
Acked-by: Martin K. Petersen
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Keith Busch
2016-02-11 23:22:01 +0800

10 Feb, 2016

3 commits

868f2f0b7 blk-mq: dynamic h/w context count ... Browse Code »

The hardware's provided queue count may change at runtime with resource
provisioning. This patch allows a block driver to alter the number of
h/w queues available when its resource count changes.

The main part is a new blk-mq API to request a new number of h/w queues
for a given live tag set. The new API freezes all queues using that set,
then adjusts the allocated count prior to remapping these to CPUs.

The bulk of the rest just shifts where h/w contexts and all their
artifacts are allocated and freed.

The number of max h/w contexts is capped to the number of possible cpus
since there is no use for more than that. As such, all pre-allocated
memory for pointers need to account for the max possible rather than
the initial number of queues.

A side effect of this is that the blk-mq will proceed successfully as
long as it can allocate at least one h/w context. Previously it would
fail request queue initialization if less than the requested number
was allocated.

Signed-off-by: Keith Busch
Reviewed-by: Christoph Hellwig
Tested-by: Jon Derrick
Signed-off-by: Jens Axboe

Keith Busch
2016-02-10 03:42:17 +0800
39a169b62 block: fix module reference leak on put_disk() call for cgroups throttle ... Browse Code »

get_disk(),get_gendisk() calls have non explicit side effect: they
increase the reference on the disk owner module.

The following is the correct sequence how to get a disk reference and
to put it:

disk = get_gendisk(...);

/* use disk */

owner = disk->fops->owner;
put_disk(disk);
module_put(owner);

fs/block_dev.c is aware of this required module_put() call, but f.e.
blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
a module reference. To see a leakage in action cgroups throttle config
can be used. In the following script I'm removing throttle for /dev/ram0
(actually this is NOP, because throttle was never set for this device):

# lsmod | grep brd
brd 5175 0
# i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
/sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
done
# lsmod | grep brd
brd 5175 100

Now brd module has 100 references.

The issue is fixed by calling module_put() just right away put_disk().

Signed-off-by: Roman Pen
Cc: Gi-Oh Kim
Cc: Tejun Heo
Cc: Jens Axboe
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe

Roman Pen
2016-02-10 03:33:35 +0800
d57d61150 kernel/fs: fix I/O wait not accounted for RW O_DSYNC ... Browse Code »

When a process is doing Random Write with O_DSYNC flag
the I/O wait are not accounted in the kernel (get_cpu_iowait_time_us).
This is preventing the governor or the cpufreq driver to account for
I/O wait and thus use the right pstate

Signed-off-by: Stephane Gasparini
Signed-off-by: Philippe Longepe
Signed-off-by: Jens Axboe

Stephane Gasparini
2016-02-10 00:27:01 +0800

05 Feb, 2016

6 commits

12ffbbe94 Merge remote-tracking branch 'mkp-scsi/4.5/scsi-fixes' into fixes Browse Code »

James Bottomley
2016-02-05 13:37:52 +0800
0fb5b1fb3 block/sd: Return -EREMOTEIO when WRITE SAME and DISCARD are disabled ... Browse Code »

When a storage device rejects a WRITE SAME command we will disable write
same functionality for the device and return -EREMOTEIO to the block
layer. -EREMOTEIO will in turn prevent DM from retrying the I/O and/or
failing the path.

Yiwen Jiang discovered a small race where WRITE SAME requests issued
simultaneously would cause -EIO to be returned. This happened because
any requests being prepared after WRITE SAME had been disabled for the
device caused us to return BLKPREP_KILL. The latter caused the block
layer to return -EIO upon completion.

To overcome this we introduce BLKPREP_INVALID which indicates that this
is an invalid request for the device. blk_peek_request() is modified to
return -EREMOTEIO in that case.

Reported-by: Yiwen Jiang
Suggested-by: Mike Snitzer
Reviewed-by: Hannes Reinicke
Reviewed-by: Ewan Milne
Reviewed-by: Yiwen Jiang
Signed-off-by: Martin K. Petersen

Martin K. Petersen
2016-02-05 11:42:58 +0800
3984aa552 cfq-iosched: Allow parent cgroup to preempt its child ... Browse Code »

Currently we don't allow sync workload of one cgroup to preempt sync
workload of any other cgroup. This is because we want to achieve service
separation between cgroups. However in cases where cgroup preempting is
ancestor of the current cgroup, there is no need of separation and
idling introduces unnecessary overhead. This hurts for example the case
when workload is isolated within a cgroup but journalling threads are in
root cgroup. Simple way to demostrate the issue is using:

dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1

on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
difference more visible). When all processes are in the root cgroup,
reported throughput is 153.132 MB/sec. When dbench process gets its own
blkio cgroup, reported throughput drops to 26.1006 MB/sec.

Fix the problem by making check in cfq_should_preempt() more benevolent
and allow preemption by ancestor cgroup. This improves the throughput
reported by dbench4 to 48.9106 MB/sec.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-02-05 00:50:43 +0800
a257ae3e4 cfq-iosched: Allow sync noidle workloads to preempt each other ... Browse Code »

The original idea with preemption of sync noidle queues (introduced in
commit 718eee0579b8 "cfq-iosched: fairness for sync no-idle queues") was
that we service all sync noidle queues together, we don't idle on any of
the queues individually and we idle only if there is no sync noidle
queue to be served. This intention also matches the original test:

if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
&& new_cfqq->service_tree == cfqq->service_tree)
return true;

However since at that time cfqq->service_tree was not set for idling
queues, this test was unreliable and was replaced in commit e4a229196a7c
"cfq-iosched: fix no-idle preemption logic" by:

if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
new_cfqq->service_tree->count == 1)
return true;

That was a reliable test but was actually doing something different -
now we preempt sync noidle queue only if the new queue is the only one
busy in the service tree.

These days cfq queue is kept in service tree even if it is idling and
thus the original check would be safe again. But since we actually check
that cfq queues are in the same cgroup, of the same priority class and
workload type (sync noidle), we know that new_cfqq is fine to preempt
cfqq. So just remove the service tree check.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-02-05 00:50:14 +0800
6c80731c7 cfq-iosched: Reorder checks in cfq_should_preempt() ... Browse Code »

Move check for preemption by rt class up. There is no functional change
but it makes arguing about conditions simpler since we can be sure both
cfq queues are from the same ioprio class.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2016-02-05 00:50:12 +0800
e795421e4 cfq-iosched: Don't group_idle if cfqq has big thinktime ... Browse Code »

There is no point in idling on a cfq group if the only cfq queue that is
there has too big thinktime.

Signed-off-by: Jan Kara
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Jan Kara
2016-02-05 00:50:11 +0800

02 Feb, 2016

2 commits

29a8ea4fb Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dan Williams:
"1/ Fixes to the libnvdimm 'pfn' device that establishes a reserved
area for storing a struct page array.

2/ Fixes for dax operations on a raw block device to prevent pagecache
collisions with dax mappings.

3/ A fix for pfn_t usage in vm_insert_mixed that lead to a null
pointer de-reference.

These have received build success notification from the kbuild robot
across 153 configs and pass the latest ndctl tests"

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
phys_to_pfn_t: use phys_addr_t
mm: fix pfn_t to page conversion in vm_insert_mixed
block: use DAX for partition table reads
block: revert runtime dax control of the raw block device
fs, block: force direct-I/O for dax-enabled block devices
devm_memremap_pages: fix vmem_altmap lifetime + alignment handling
libnvdimm, pfn: fix restoring memmap location
libnvdimm: fix mode determination for e820 devices

Linus Torvalds
2016-02-02 07:21:20 +0800
e502fb8f8 deadline: remove unused struct member ... Browse Code »

commit 63de428b139d3d31d86ebe25ae97b33f6540fb7e ("deadline-iosched:
allow non-sequential batching") removed last use of last_sector.

Signed-off-by: Tahsin Erdogan
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Tahsin Erdogan
2016-02-02 00:09:55 +0800