Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

05 May, 2016

1 commit

5b616a05d block: partition: initialize percpuref before sending out KOBJ_ADD ... Browse Code »

commit b30a337ca27c4f40439e4bfb290cba5f88d73bb7 upstream.

The initialization of partition's percpu_ref should have been done before
sending out KOBJ_ADD uevent, which may cause userspace to read partition
table. So the uninitialized percpu_ref may be accessed in data path.

This patch fixes this issue reported by Naveen.

Reported-by: Naveen Kaje
Tested-by: Naveen Kaje
Fixes: 6c71013ecb7e2(block: partition: convert percpu ref)
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2016-05-05 05:48:39 +0800

13 Apr, 2016

1 commit

5504a4708 dm: fix excessive dm-mq context switching ... Browse Code »

commit 6acfe68bac7e6f16dc312157b1fa6e2368985013 upstream.

Request-based DM's blk-mq support (dm-mq) was reported to be 50% slower
than if an underlying null_blk device were used directly. One of the
reasons for this drop in performance is that blk_insert_clone_request()
was calling blk_mq_insert_request() with @async=true. This forced the
use of kblockd_schedule_delayed_work_on() to run the blk-mq hw queues
which ushered in ping-ponging between process context (fio in this case)
and kblockd's kworker to submit the cloned request. The ftrace
function_graph tracer showed:

kworker-2013 => fio-12190
fio-12190 => kworker-2013
...
kworker-2013 => fio-12190
fio-12190 => kworker-2013
...

Fixing blk_insert_clone_request()'s blk_mq_insert_request() call to
_not_ use kblockd to submit the cloned requests isn't enough to
eliminate the observed context switches.

In addition to this dm-mq specific blk-core fix, there are 2 DM core
fixes to dm-mq that (when paired with the blk-core fix) completely
eliminate the observed context switching:

1) don't blk_mq_run_hw_queues in blk-mq request completion

Motivated by desire to reduce overhead of dm-mq, punting to kblockd
just increases context switches.

In my testing against a really fast null_blk device there was no benefit
to running blk_mq_run_hw_queues() on completion (and no other blk-mq
driver does this). So hopefully this change doesn't induce the need for
yet another revert like commit 621739b00e16ca2d !

2) use blk_mq_complete_request() in dm_complete_request()

blk_complete_request() doesn't offer the traditional q->mq_ops vs
.request_fn branching pattern that other historic block interfaces
do (e.g. blk_get_request). Using blk_mq_complete_request() for
blk-mq requests is important for performance. It should be noted
that, like blk_complete_request(), blk_mq_complete_request() doesn't
natively handle partial completions -- but the request-based
DM-multipath target does provide the required partial completion
support by dm.c:end_clone_bio() triggering requeueing of the request
via dm-mpath.c:multipath_end_io()'s return of DM_ENDIO_REQUEUE.

dm-mq fix #2 is _much_ more important than #1 for eliminating the
context switches.
Before: cpu : usr=15.10%, sys=59.39%, ctx=7905181, majf=0, minf=475
After: cpu : usr=20.60%, sys=79.35%, ctx=2008, majf=0, minf=472

With these changes multithreaded async read IOPs improved from ~950K
to ~1350K for this dm-mq stacked on null_blk test-case. The raw read
IOPs of the underlying null_blk device for the same workload is ~1950K.

Fixes: 7fb4898e0 ("block: add blk-mq support to blk_insert_cloned_request()")
Fixes: bfebd1cdb ("dm: add full blk-mq support to request-based DM")
Reported-by: Sagi Grimberg
Signed-off-by: Mike Snitzer
Acked-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Mike Snitzer
2016-04-13 00:08:40 +0800

10 Mar, 2016

1 commit

9ec190906 block: Initialize max_dev_sectors to 0 ... Browse Code »

commit 5f009d3f8e6685fe8c6215082c1696a08b411220 upstream.

The new queue limit is not used by the majority of block drivers, and
should be initialized to 0 for the driver's requested settings to be used.

Signed-off-by: Keith Busch
Acked-by: Martin K. Petersen
Reviewed-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Keith Busch
2016-03-10 07:34:49 +0800

04 Mar, 2016

1 commit

3e643b5cb bio: return EINTR if copying to user space got interrupted ... Browse Code »

commit 2d99b55d378c996b9692a0c93dd25f4ed5d58934 upstream.

Commit 35dc248383bbab0a7203fca4d722875bc81ef091 introduced a check for
current->mm to see if we have a user space context and only copies data
if we do. Now if an IO gets interrupted by a signal data isn't copied
into user space any more (as we don't have a user space context) but
user space isn't notified about it.

This patch modifies the behaviour to return -EINTR from bio_uncopy_user()
to notify userland that a signal has interrupted the syscall, otherwise
it could lead to a situation where the caller may get a buffer with
no data returned.

This can be reproduced by issuing SG_IO ioctl()s in one thread while
constantly sending signals to it.

Fixes: 35dc248 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
Signed-off-by: Johannes Thumshirn
Signed-off-by: Hannes Reinecke
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Hannes Reinecke
2016-03-04 07:07:28 +0800

18 Feb, 2016

2 commits

bfc5caf75 block: fix bio splitting on max sectors ... Browse Code »

commit d0e5fbb01a67e400e82fefe4896ea40c6447ab98 upstream.

After commit e36f62042880(block: split bios to maxpossible length),
bio can be splitted in the middle of a vector entry, then it
is easy to split out one bio which size isn't aligned with block
size, especially when the block size is bigger than 512.

This patch fixes the issue by making the max io size aligned
to logical block size.

Fixes: e36f62042880(block: split bios to maxpossible length)
Reported-by: Stefan Haberland
Cc: Keith Busch
Suggested-by: Linus Torvalds
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2016-02-18 04:30:56 +0800
d2081cfe6 block: split bios to max possible length ... Browse Code »

commit e36f6204288088fda50d1c84830340ccb70f85ff upstream.

This splits bio in the middle of a vector to form the largest possible
bio at the h/w's desired alignment, and guarantees the bio being split
will have some data.

The criteria for splitting is changed from the max sectors to the h/w's
optimal sector alignment if it is provided. For h/w that advertise their
block storage's underlying chunk size, it's a big performance win to not
submit commands that cross them. If sector alignment is not provided,
this patch uses the max sectors as before.

This addresses the performance issue commit d380561113 attempted to
fix, but was reverted due to splitting logic error.

Signed-off-by: Keith Busch
Cc: Jens Axboe
Cc: Ming Lei
Cc: Kent Overstreet
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Keith Busch
2016-02-18 04:30:55 +0800

09 Jan, 2016

1 commit

6126eb248 Revert "block: Split bios on chunk boundaries" ... Browse Code »

This reverts commit d3805611130af9b911e908af9f67a3f64f4f0914.

If we end up splitting on the first segment, we don't adjust
the sector count. That results in hitting a BUG() with attempting
to split 0 sectors.

As this is just a performance issue and not a regression since
4.3 release, let's just rever this change. That gives us more
time to test a real fix for 4.5, which would be marked for
stable anyway.

Jens Axboe
2016-01-09 00:00:29 +0800

29 Dec, 2015

1 commit

21491412f block: add blk_start_queue_async() ... Browse Code »

We currently only have an inline/sync helper to restart a stopped
queue. If drivers need an async version, they have to roll their
own. Add a generic helper instead.

Signed-off-by: Jens Axboe

Jens Axboe
2015-12-29 04:07:07 +0800

23 Dec, 2015

3 commits

d38056111 block: Split bios on chunk boundaries ... Browse Code »
10

For h/w that advertise their block storage's underlying chunk size, it's
a big performance win to not submit commands that cross them. This patch
uses that criteria if it is provided. If it is not provided, this patch
uses the max sectors as before.

Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2015-12-23 08:19:25 +0800
24bc3ea5d Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:
"Three small fixes for 4.4 final. Specifically:

- The segment issue fix from Junichi, where the old IO path does a
bio limit split before potentially bouncing the pages. We need to
do that in the right order, to ensure that limitations are met.

- A NVMe surprise removal IO hang fix from Keith.

- A use-after-free in null_blk, introduced by a previous patch in
this series. From Mike Krinkin"

* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: fix use-after-free error
block: ensure to split after potentially bouncing a bio
NVMe: IO ending fixes on surprise removal

Linus Torvalds
2015-12-23 08:00:25 +0800
23688bf4f block: ensure to split after potentially bouncing a bio ... Browse Code »

blk_queue_bio() does split then bounce, which makes the segment
counting based on pages before bouncing and could go wrong. Move
the split to after bouncing, like we do for blk-mq, and the we
fix the issue of having the bio count for segments be wrong.

Fixes: 54efd50bfd87 ("block: make generic_make_request handle arbitrarily sized bios")
Cc: stable@vger.kernel.org
Tested-by: Artem S. Tashkinov
Signed-off-by: Jens Axboe

Junichi Nomura
2015-12-23 01:26:53 +0800

13 Dec, 2015

1 commit

780756318 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:
"A set of fixes for the current series. This contains:

- A bunch of fixes for lightnvm, should be the last round for this
series. From Matias and Wenwei.

- A writeback detach inode fix from Ilya, also marked for stable.

- A block (though it says SCSI) fix for an OOPS in SCSI runtime power
management.

- Module init error path fixes for null_blk from Minfei"

* 'for-linus' of git://git.kernel.dk/linux-block:
null_blk: Fix error path in module initialization
lightnvm: do not compile in debugging by default
lightnvm: prevent gennvm module unload on use
lightnvm: fix media mgr registration
lightnvm: replace req queue with nvmdev for lld
lightnvm: comments on constants
lightnvm: check mm before use
lightnvm: refactor spin_unlock in gennvm_get_blk
lightnvm: put blks when luns configure failed
lightnvm: use flags in rrpc_get_blk
block: detach bdev inode from its wb in __blkdev_put()
SCSI: Fix NULL pointer dereference in runtime PM

Linus Torvalds
2015-12-13 02:24:00 +0800

07 Dec, 2015

2 commits

0b98f0c04 Merge branch 'master' into for-4.4-fixes ... Browse Code »

The following commit which went into mainline through networking tree

3b13758f51de ("cgroups: Allow dynamically changing net_classid")

conflicts in net/core/netclassid_cgroup.c with the following pending
fix in cgroup/for-4.4-fixes.

1f7dd3e5a6e4 ("cgroup: fix handling of multi-destination migration from subtree_control enabling")

The former separates out update_classid() from cgrp_attach() and
updates it to walk all fds of all tasks in the target css so that it
can be used from both migration and config change paths. The latter
drops @css from cgrp_attach().

Resolve the conflict by making cgrp_attach() call update_classid()
with the css from the first task. We can revive @tset walking in
cgrp_attach() but given that net_cls is v1 only where there always is
only one target css during migration, this is fine.

Signed-off-by: Tejun Heo
Reported-by: Stephen Rothwell
Cc: Nina Schiff

Tejun Heo
2015-12-07 23:09:03 +0800
19190f5ea Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi ... Browse Code »

Pull SCSI fixes from James Bottomley:
"This is quite a bumper crop of fixes: three from Arnd correcting
various build issues in some configurations, a lock recursion in
qla2xxx. Two potentially exploitable issues in hpsa and mvsas, a
potential null deref in st, a revert of a bdi registration fix that
turned out to cause even more problems, a set of fixes to allow people
who only defined MPT2SAS to still work after the mpt2/mpt3sas merger
and a couple of fixes for issues turned up by the hyper-v storvsc
driver"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
mpt3sas: fix Kconfig dependency problem for mpt2sas back compatibility
Revert "scsi: Fix a bdi reregistration race"
mpt3sas: Add dummy Kconfig option for backwards compatibility
Fix a memory leak in scsi_host_dev_release()
block/sd: Fix device-imposed transfer length limits
scsi_debug: fix prevent_allow+verify regressions
MAINTAINERS: Add myself as co-maintainer of the SCSI subsystem.
sd: Make discard granularity match logical block size when LBPRZ=1
scsi: hpsa: select CONFIG_SCSI_SAS_ATTR
scsi: advansys needs ISA dma api for ISA support
scsi_sysfs: protect against double execution of __scsi_remove_device()
st: fix potential null pointer dereference.
scsi: report 'INQUIRY result too short' once per host
advansys: fix big-endian builds
qla2xxx: Fix rwlock recursion
hpsa: logical vs bitwise AND typo
mvsas: don't allow negative timeouts
mpt3sas: Fix use sas_is_tlr_enabled API before enabling MPI2_SCSIIO_CONTROL_TLR_ON flag

Linus Torvalds
2015-12-07 00:02:25 +0800

04 Dec, 2015

2 commits

4fd41a855 SCSI: Fix NULL pointer dereference in runtime PM ... Browse Code »

The routines in scsi_pm.c assume that if a runtime-PM callback is
invoked for a SCSI device, it can only mean that the device's driver
has asked the block layer to handle the runtime power management (by
calling blk_pm_runtime_init(), which among other things sets q->dev).

However, this assumption turns out to be wrong for things like the ses
driver. Normally ses devices are not allowed to do runtime PM, but
userspace can override this setting. If this happens, the kernel gets
a NULL pointer dereference when blk_post_runtime_resume() tries to use
the uninitialized q->dev pointer.

This patch fixes the problem by checking q->dev in block layer before
handle runtime PM. Since ses doesn't define any PM callbacks and call
blk_pm_runtime_init(), the crash won't occur.

This fixes Bugzilla #101371.
https://bugzilla.kernel.org/show_bug.cgi?id=101371

More discussion can be found from below link.
http://marc.info/?l=linux-scsi&m=144163730531875&w=2

Signed-off-by: Ken Xue
Acked-by: Alan Stern
Cc: Xiangliang Yu
Cc: James E.J. Bottomley
Cc: Jens Axboe
Cc: Michael Terry
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

Ken Xue
2015-12-04 11:35:02 +0800
be9e2f775 Merge branch 'mkp-fixes' into fixes Browse Code »

James Bottomley
2015-12-04 01:32:33 +0800

03 Dec, 2015

1 commit

1f7dd3e5a cgroup: fix handling of multi-destination migration from subtree_control enabling ... Browse Code »
9

Consider the following v2 hierarchy.

P0 (+memory) --- P1 (-memory) --- A
\- B

P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[] dump_stack+0x4e/0x82
[] warn_slowpath_common+0x82/0xc0
[] warn_slowpath_null+0x1a/0x20
[] pids_cancel.constprop.6+0x31/0x40
[] pids_can_attach+0x6d/0xf0
[] cgroup_taskset_migrate+0x6c/0x330
[] cgroup_migrate+0xf5/0x190
[] cgroup_attach_task+0x176/0x200
[] __cgroup_procs_write+0x2ad/0x460
[] cgroup_procs_write+0x14/0x20
[] cgroup_file_write+0x35/0x1c0
[] kernfs_fop_write+0x141/0x190
[] __vfs_write+0x28/0xe0
[] vfs_write+0xac/0x1a0
[] SyS_write+0x49/0xb0
[] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.

* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.

Signed-off-by: Tejun Heo
Reported-and-tested-by: Daniel Wagner
Cc: Aleksa Sarai

Tejun Heo
2015-12-03 23:18:21 +0800

01 Dec, 2015

1 commit

a88d32af1 blk-merge: fix computing bio->bi_seg_front_size in case of single segment ... Browse Code »

When bio has only one physical segment, we should set bio's
bi_seg_front_size as the real(final) size of the single segment.

Fixes: 02e707424c2ea(blk-merge: fix blk_bio_segment_split)
Reported-by: Markus Trippelsdorf
Tested-by: Markus Trippelsdorf
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-12-01 04:02:36 +0800

30 Nov, 2015

1 commit

bf4e6b4e7 block: Always check queue limits for cloned requests ... Browse Code »

When a cloned request is retried on other queues it always needs
to be checked against the queue limits of that queue.
Otherwise the calculations for nr_phys_segments might be wrong,
leading to a crash in scsi_init_sgtable().

To clarify this the patch renames blk_rq_check_limits()
to blk_cloned_rq_check_limits() and removes the symbol
export, as the new function should only be used for
cloned requests and never exported.

Cc: Mike Snitzer
Cc: Ewan Milne
Cc: Jeff Moyer
Signed-off-by: Hannes Reinecke
Fixes: e2a60da74 ("block: Clean up special command handling logic")
Cc: stable@vger.kernel.org # 3.7+
Acked-by: Mike Snitzer
Signed-off-by: Jens Axboe

Hannes Reinecke
2015-11-30 05:37:27 +0800

26 Nov, 2015

3 commits

77032ca66 Return EBUSY from BLKRRPART for mounted whole-dev fs ... Browse Code »

Today, blockdev --rereadpt /dev/sda will fail with EBUSY if any
partition of sda is mounted (and will fail with EINVAL if pointed
at a partition). But it will pass if the entire block device is
formatted with a filesystem and mounted. I don't think this makes
sense; partitioning should surely not ever change out from under
a mounted device.

So check for bdev->bd_super, and fail that with -EBUSY as well.

Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Eric Sandeen
2015-11-26 11:49:24 +0800
ca369d51b block/sd: Fix device-imposed transfer length limits ... Browse Code »
17

Commit 4f258a46346c ("sd: Fix maximum I/O size for BLOCK_PC requests")
had the unfortunate side-effect of removing an implicit clamp to
BLK_DEF_MAX_SECTORS for REQ_TYPE_FS requests in the block layer
code. This caused problems for some SMR drives.

Debugging this issue revealed a few problems with the existing
infrastructure since the block layer didn't know how to deal with
device-imposed limits, only limits set by the I/O controller.

- Introduce a new queue limit, max_dev_sectors, which is used by the
ULD to signal the maximum sectors for a REQ_TYPE_FS request.

- Ensure that max_dev_sectors is correctly stacked and taken into
account when overriding max_sectors through sysfs.

- Rework sd_read_block_limits() so it saves the max_xfer and opt_xfer
values for later processing.

- In sd_revalidate() set the queue's max_dev_sectors based on the
MAXIMUM TRANSFER LENGTH value in the Block Limits VPD. If this value
is not reported, fall back to a cap based on the CDB TRANSFER LENGTH
field size.

- In sd_revalidate(), use OPTIMAL TRANSFER LENGTH from the Block Limits
VPD--if reported and sane--to signal the preferred device transfer
size for FS requests. Otherwise use BLK_DEF_MAX_SECTORS.

- blk_limits_max_hw_sectors() is no longer used and can be removed.

Signed-off-by: Martin K. Petersen
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=93581
Reviewed-by: Christoph Hellwig
Tested-by: sweeneygj@gmx.com
Tested-by: Arzeets
Tested-by: David Eisner
Tested-by: Mario Kicherer
Signed-off-by: Martin K. Petersen

Martin K. Petersen
2015-11-26 10:38:58 +0800
dcd8376c3 Revert "blk-flush: Queue through IO scheduler when flush not required" ... Browse Code »

This reverts commit 1b2ff19e6a957b1ef0f365ad331b608af80e932e.

Jan writes:

--

Thanks for report! After some investigation I found out we allocate
elevator specific data in __get_request() only for non-flush requests. And
this is actually required since the flush machinery uses the space in
struct request for something else. Doh. So my patch is just wrong and not
easy to fix since at the time __get_request() is called we are not sure
whether the flush machinery will be used in the end. Jens, please revert
1b2ff19e6a957b1ef0f365ad331b608af80e932e. Thanks!

I'm somewhat surprised that you can reliably hit the race where flushing
gets disabled for the device just while the request is in flight. But I
guess during boot it makes some sense.

--

So let's just revert it, we can fix the queue run manually after the
fact. This race is rare enough that it didn't trigger in testing, it
requires the specific disable-while-in-flight scenario to trigger.

Jens Axboe
2015-11-26 01:12:54 +0800

25 Nov, 2015

2 commits

55ce0da1d block: fix blk_abort_request for blk-mq drivers ... Browse Code »

We only added the request to the request list for the !blk-mq case,
so we should only delete it in that case as well.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-11-25 06:24:10 +0800
4ce01c518 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:
"A round of fixes/updates for the current series.

This looks a little bigger than it is, but that's mainly because we
pushed the lightnvm enabled null_blk change out of the merge window so
it could be updated a bit. The rest of the volume is also mostly
lightnvm. In particular:

- Lightnvm. Various fixes, additions, updates from Matias and
Javier, as well as from Wenwei Tao.

- NVMe:
- Fix for potential arithmetic overflow from Keith.
- Also from Keith, ensure that we reap pending completions from
a completion queue before deleting it. Fixes kernel crashes
when resetting a device with IO pending.
- Various little lightnvm related tweaks from Matias.

- Fixup flushes to go through the IO scheduler, for the cases where a
flush is not required. Fixes a case in CFQ where we would be
idling and not see this request, hence not break the idling. From
Jan Kara.

- Use list_{first,prev,next} in elevator.c for cleaner code. From
Gelian Tang.

- Fix for a warning trigger on btrfs and raid on single queue blk-mq
devices, where we would flush plug callbacks with preemption
disabled. From me.

- A mac partition validation fix from Kees Cook.

- Two merge fixes from Ming, marked stable. A third part is adding a
new warning so we'll notice this quicker in the future, if we screw
up the accounting.

- Cleanup of thread name/creation in mtip32xx from Rasmus Villemoes"

* 'for-linus' of git://git.kernel.dk/linux-block: (32 commits)
blk-merge: warn if figured out segment number is bigger than nr_phys_segments
blk-merge: fix blk_bio_segment_split
block: fix segment split
blk-mq: fix calling unplug callbacks with preempt disabled
mac: validate mac_partition is within sector
mtip32xx: use formatting capability of kthread_create_on_node
NVMe: reap completion entries when deleting queue
lightnvm: add free and bad lun info to show luns
lightnvm: keep track of block counts
nvme: lightnvm: use admin queues for admin cmds
lightnvm: missing free on init error
lightnvm: wrong return value and redundant free
null_blk: do not del gendisk with lightnvm
null_blk: use device addressing mode
null_blk: use ppa_cache pool
NVMe: Fix possible arithmetic overflow for max segments
blk-flush: Queue through IO scheduler when flush not required
null_blk: register as a LightNVM device
elevator: use list_{first,prev,next}_entry
lightnvm: cleanup queue before target removal
...

Linus Torvalds
2015-11-25 02:26:30 +0800

24 Nov, 2015

3 commits

12e57f59c blk-merge: warn if figured out segment number is bigger than nr_phys_segments ... Browse Code »

We had seen lots of reports of this kind issue, so add one
warnning in blk-merge, then it can be triggered easily and
avoid to depend on warning/bug from drivers.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-11-24 11:16:55 +0800
02e707424 blk-merge: fix blk_bio_segment_split ... Browse Code »
5

Commit bdced438acd83a(block: setup bi_phys_segments after
splitting) introduces function of computing bio->bi_phys_segments
during bio splitting.

Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
arn't computed, so too many physical segments may be obtained
for one request since both the two are used to check if one segment
across two bios can be possible.

This patch fixes the issue by computing the two variables in
blk_bio_segment_split().

Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
Reported-by: Michael Ellerman
Reported-by: Mark Salter
Tested-by: Laurent Dufour
Tested-by: Mark Salter
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-11-24 11:16:53 +0800
578270bfb block: fix segment split ... Browse Code »

Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
always points to the iterator local variable, which is obviously
wrong, so fix it by pointing to the local variable of 'bvprv'.

Fixes: 5014c311baa2b(block: fix bogus compiler warnings in blk-merge.c)
Cc: stable@kernel.org #4.3
Reported-by: Michael Ellerman
Reported-by: Mark Salter
Tested-by: Laurent Dufour
Tested-by: Mark Salter
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-11-24 11:16:51 +0800

21 Nov, 2015

1 commit

b094f89ca blk-mq: fix calling unplug callbacks with preempt disabled ... Browse Code »

Liu reported that running certain parts of xfstests threw the
following error:

BUG: sleeping function called from invalid context at mm/page_alloc.c:3190
in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0
3 locks held by kworker/u16:0/6:
#0: ("writeback"){++++.+}, at: [] process_one_work+0x173/0x730
#1: ((&(&wb->dwork)->work)){+.+.+.}, at: [] process_one_work+0x173/0x730
#2: (&type->s_umount_key#44){+++++.}, at: [] trylock_super+0x25/0x60
CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
Workqueue: writeback wb_workfn (flush-btrfs-108)
ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab
0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8
ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76
Call Trace:
[] dump_stack+0x4f/0x74
[] ___might_sleep+0x185/0x240
[] __might_sleep+0x52/0x90
[] __alloc_pages_nodemask+0x268/0x410
[] ? sched_clock_local+0x1c/0x90
[] ? local_clock+0x21/0x40
[] ? __lock_release+0x420/0x510
[] ? __lock_acquired+0x16c/0x3c0
[] alloc_pages_current+0xc5/0x210
[] ? rbio_is_full+0x55/0x70 [btrfs]
[] ? mark_held_locks+0x78/0xa0
[] ? _raw_spin_unlock_irqrestore+0x40/0x60
[] full_stripe_write+0x5a/0xc0 [btrfs]
[] __raid56_parity_write+0x39/0x60 [btrfs]
[] run_plug+0x11b/0x140 [btrfs]
[] btrfs_raid_unplug+0x23/0x70 [btrfs]
[] blk_flush_plug_list+0x82/0x1f0
[] blk_sq_make_request+0x1f9/0x740
[] ? generic_make_request_checks+0x222/0x7c0
[] ? blk_queue_enter+0x124/0x310
[] ? blk_queue_enter+0x92/0x310
[] generic_make_request+0x172/0x2c0
[] ? generic_make_request+0x164/0x2c0
[] submit_bio+0x70/0x140
[] ? rbio_add_io_page+0x99/0x150 [btrfs]
[] finish_rmw+0x4d9/0x600 [btrfs]
[] full_stripe_write+0x9c/0xc0 [btrfs]
[] raid56_parity_write+0xef/0x160 [btrfs]
[] btrfs_map_bio+0xe3/0x2d0 [btrfs]
[] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs]
[] submit_one_bio+0x74/0xb0 [btrfs]
[] submit_extent_page+0xe5/0x1c0 [btrfs]
[] __extent_writepage_io+0x408/0x4c0 [btrfs]
[] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs]
[] __extent_writepage+0x218/0x3a0 [btrfs]
[] ? mark_held_locks+0x78/0xa0
[] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs]
[] extent_writepages+0x52/0x70 [btrfs]
[] ? btrfs_set_inode_index+0x70/0x70 [btrfs]
[] btrfs_writepages+0x27/0x30 [btrfs]
[] do_writepages+0x23/0x40
[] __writeback_single_inode+0x89/0x4d0
[] ? writeback_sb_inodes+0x260/0x480
[] ? writeback_sb_inodes+0x260/0x480
[] ? writeback_sb_inodes+0x15f/0x480
[] writeback_sb_inodes+0x2d2/0x480
[] ? down_read_trylock+0x57/0x60
[] ? trylock_super+0x25/0x60
[] ? rcu_read_lock_sched_held+0x4f/0x90
[] __writeback_inodes_wb+0x8c/0xc0
[] wb_writeback+0x2b5/0x500
[] ? mark_held_locks+0x78/0xa0
[] ? __local_bh_enable_ip+0x68/0xc0
[] ? wb_do_writeback+0x62/0x310
[] wb_do_writeback+0xc1/0x310
[] ? set_worker_desc+0x79/0x90
[] wb_workfn+0x92/0x330
[] process_one_work+0x223/0x730
[] ? process_one_work+0x173/0x730
[] ? worker_thread+0x18f/0x430
[] worker_thread+0x11d/0x430
[] ? maybe_create_worker+0xf0/0xf0
[] ? maybe_create_worker+0xf0/0xf0
[] kthread+0xef/0x110
[] ? schedule_tail+0x1e/0xd0
[] ? __init_kthread_worker+0x70/0x70
[] ret_from_fork+0x3f/0x70
[] ? __init_kthread_worker+0x70/0x70

The issue is that we've got the software context pinned while
calling blk_flush_plug_list(), which flushes callbacks that
are allowed to sleep. btrfs and raid has such callbacks.

Flip the checks around a bit, so we can enable preempt a bit
earlier and flush plugs without having preempt disabled.

This only affects blk-mq driven devices, and only those that
register a single queue.

Reported-by: Liu Bo
Tested-by: Liu Bo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jens Axboe
2015-11-21 11:29:45 +0800

20 Nov, 2015

2 commits

02e2a5bfe mac: validate mac_partition is within sector ... Browse Code »

If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single
512 byte sector would be read (secsize / 512). However the partition
structure would be located past the end of the buffer (secsize % 512).

Signed-off-by: Kees Cook
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

Kees Cook
2015-11-20 23:49:28 +0800
2e6edc953 block: protect rw_page against device teardown ... Browse Code »

Fix use after free crashes like the following:

general protection fault: 0000 [#1] SMP
Call Trace:
[] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
[] pmem_rw_page+0x42/0x80 [nd_pmem]
[] bdev_read_page+0x50/0x60
[] do_mpage_readpage+0x510/0x770
[] ? I_BDEV+0x20/0x20
[] ? lru_cache_add+0x1c/0x50
[] mpage_readpages+0x107/0x170
[] ? I_BDEV+0x20/0x20
[] ? I_BDEV+0x20/0x20
[] blkdev_readpages+0x1d/0x20
[] __do_page_cache_readahead+0x28f/0x310
[] ? __do_page_cache_readahead+0x169/0x310
[] ? pagecache_get_page+0x2d/0x1d0
[] filemap_fault+0x396/0x530
[] __do_fault+0x4e/0xf0
[] handle_mm_fault+0x11bd/0x1b50

Cc:
Cc: Jens Axboe
Cc: Alexander Viro
Reported-by: kbuild test robot
Acked-by: Matthew Wilcox
[willy: symmetry fixups]
Signed-off-by: Dan Williams

Dan Williams
2015-11-20 05:47:10 +0800

17 Nov, 2015

2 commits

1b2ff19e6 blk-flush: Queue through IO scheduler when flush not required ... Browse Code »
14

Currently blk_insert_flush() just adds flush request to q->queue_head
when flush is not required. That completely bypasses IO scheduler so
e.g. CFQ can be idling waiting for new request to arrive and will idle
through the whole window unnecessarily. Luckily this only happens in
rare cases as usually checks in generic_make_request_checks() clear
FLUSH and FUA flags early if they are not needed.

When no flushing is actually required, we can easily fix the problem by
properly queueing the request through the IO scheduler. Ideally IO
scheduler should be also made aware of requests queued via
blk_flush_queue_rq(). However inserting flush request through IO
scheduler can have unwanted side-effects since due to flush batching
delaying the flush request in IO scheduler will delay all flush requests
possibly coming from other processes. So we keep adding the request
directly to q->queue_head.

Signed-off-by: Jan Kara
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jan Kara
2015-11-17 06:23:51 +0800
4736346bb elevator: use list_{first,prev,next}_entry ... Browse Code »

To make the intention clearer, use list_{first,prev,next}_entry
instead of list_entry.

Signed-off-by: Geliang Tang
Signed-off-by: Jens Axboe

Geliang Tang
2015-11-17 06:21:48 +0800

12 Nov, 2015

2 commits

ccc2600b8 block: fix blk-core.c kernel-doc warning ... Browse Code »

Fix kernel-doc warning in blk-core.c:

Warning(..//block/blk-core.c:1549): No description found for parameter 'same_queue_rq'

Signed-off-by: Randy Dunlap
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Randy Dunlap
2015-11-12 00:36:57 +0800
1fa8cc52f blk-mq: mark __blk_mq_complete_request() static ... Browse Code »

It's no longer used outside of blk-mq core.

Signed-off-by: Jens Axboe

Jens Axboe
2015-11-12 00:36:56 +0800

11 Nov, 2015

1 commit

3419b4503 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO poll support from Jens Axboe:
"Various groups have been doing experimentation around IO polling for
(really) fast devices. The code has been reviewed and has been
sitting on the side for a few releases, but this is now good enough
for coordinated benchmarking and further experimentation.

Currently O_DIRECT sync read/write are supported. A framework is in
the works that allows scalable stats tracking so we can auto-tune
this. And we'll add libaio support as well soon. Fow now, it's an
opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
direct-io: be sure to assign dio->bio_bdev for both paths
directio: add block polling support
NVMe: add blk polling support
block: add block polling support
blk-mq: return tag/queue combo in the make_request_fn handlers
block: change ->make_request_fn() and users to return a queue cookie

Linus Torvalds
2015-11-11 09:23:49 +0800

08 Nov, 2015

3 commits

05229beed block: add block polling support ... Browse Code »

Add basic support for polling for specific IO to complete. This uses
the cookie that blk-mq passes back, which enables the block layer
to pass this cookie to the driver to spin for a specific request.

This will be combined with request latency tracking, so we can make
qualified decisions about when to poll and when not to. For now, for
benchmark purposes, we add a sysfs file that controls whether polling
is enabled or not.

Signed-off-by: Jens Axboe
Acked-by: Christoph Hellwig
Acked-by: Keith Busch

Jens Axboe
2015-11-08 01:40:47 +0800
7b371636f blk-mq: return tag/queue combo in the make_request_fn handlers ... Browse Code »

Return a cookie, blk_qc_t, from the blk-mq make request functions, that
allows a later caller to uniquely identify a specific IO. The cookie
doesn't mean anything to the caller, but the caller can use it to later
pass back to the block layer. The block layer can then identify the
hardware queue and request from that cookie.

Signed-off-by: Jens Axboe
Acked-by: Christoph Hellwig
Acked-by: Keith Busch

Jens Axboe
2015-11-08 01:40:47 +0800
dece16353 block: change ->make_request_fn() and users to return a queue cookie ... Browse Code »

No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe
Acked-by: Christoph Hellwig
Acked-by: Keith Busch

Jens Axboe
2015-11-08 01:40:46 +0800

07 Nov, 2015

2 commits

8639b4613 pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode ... Browse Code »

setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of
the current pid namespace. This is in contrast to both the other modes of
setpriority and the example of kill(-1). Fix this. getpriority and
ioprio have the same failure mode, fix them too.

Eric said:

: After some more thinking about it this patch sounds justifiable.
:
: My goal with namespaces is not to build perfect isolation mechanisms
: as that can get into ill defined territory, but to build well defined
: mechanisms. And to handle the corner cases so you can use only
: a single namespace with well defined results.
:
: In this case you have found the two interfaces I am aware of that
: identify processes by uid instead of by pid. Which quite frankly is
: weird. Unfortunately the weird unexpected cases are hard to handle
: in the usual way.
:
: I was hoping for a little more information. Changes like this one we
: have to be careful of because someone might be depending on the current
: behavior. I don't think they are and I do think this make sense as part
: of the pid namespace.

Signed-off-by: Ben Segall
Cc: Oleg Nesterov
Cc: Al Viro
Cc: Ambrose Feinstein
Acked-by: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Segall
2015-11-07 09:50:42 +0800
71baba4b9 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM ... Browse Code »

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Acked-by: David Rientjes
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800