Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

16 Jan, 2015

6 commits

ee9b14283 genhd: check for int overflow in disk_expand_part_tbl() ... Browse Code »

commit 5fabcb4c33fe11c7e3afdf805fde26c1a54d0953 upstream.

We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition()
with a user passed in partno value. If we pass in 0x7fffffff, the
new target in disk_expand_part_tbl() overflows the 'int' and we
access beyond the end of ptbl->part[] and even write to it when we
do the rcu_assign_pointer() to assign the new partition.

Reported-by: David Ramos
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Jens Axboe
2015-01-16 22:59:52 +0800
fa5e4747a blk-mq: Fix uninitialized kobject at CPU hotplugging ... Browse Code »

commit 06a41a99d13d8e919e9a00a4849e6b85ae492592 upstream.

When a CPU is hotplugged, the current blk-mq spews a warning like:

kobject '(null)' (ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong.
CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8
ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58
ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007
Call Trace:
[] dump_trace+0x86/0x330
[] show_stack_log_lvl+0x94/0x170
[] show_stack+0x21/0x50
[] dump_stack+0x41/0x51
[] kobject_add+0xa0/0xb0
[] blk_mq_register_hctx+0x91/0xb0
[] blk_mq_sysfs_register+0x3e/0x60
[] blk_mq_queue_reinit_notify+0xf8/0x190
[] notifier_call_chain+0x4c/0x70
[] cpu_notify+0x23/0x50
[] _cpu_up+0x157/0x170
[] cpu_up+0x89/0xb0
[] cpu_subsys_online+0x35/0x80
[] device_online+0x5d/0xa0
[] online_store+0x75/0x80
[] kernfs_fop_write+0xda/0x150
[] vfs_write+0xb2/0x1f0
[] SyS_write+0x42/0xb0
[] system_call_fastpath+0x16/0x1b
[] 0x7f0132fb24e0

This is indeed because of an uninitialized kobject for blk_mq_ctx.
The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it
goes loop over hctx_for_each_ctx(), i.e. it initializes only for
online CPUs. Thus, when a CPU is hotplugged, the ctx for the newly
onlined CPU is registered without initialization.

This patch fixes the issue by initializing the all ctx kobjects
belonging to each queue.

Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794
Signed-off-by: Takashi Iwai
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Takashi Iwai
2015-01-16 22:59:48 +0800
3a6d40057 blk-mq: Fix a race between bt_clear_tag() and bt_get() ... Browse Code »

commit c38d185d4af12e8be63ca4b6745d99449c450f12 upstream.

What we need is the following two guarantees:
* Any thread that observes the effect of the test_and_set_bit() by
__bt_get_word() also observes the preceding addition of 'current'
to the appropriate wait list. This is guaranteed by the semantics
of the spin_unlock() operation performed by prepare_and_wait().
Hence the conversion of test_and_set_bit_lock() into
test_and_set_bit().
* The wait lists are examined by bt_clear() after the tag bit has
been cleared. clear_bit_unlock() guarantees that any thread that
observes that the bit has been cleared also observes the store
operations preceding clear_bit_unlock(). However,
clear_bit_unlock() does not prevent that the wait lists are examined
before that the tag bit is cleared. Hence the addition of a memory
barrier between clear_bit() and the wait list examination.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Robert Elliott
Cc: Ming Lei
Cc: Alexander Gordeev
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2015-01-16 22:59:48 +0800
d04e14ab4 blk-mq: Avoid that __bt_get_word() wraps multiple times ... Browse Code »

commit 9e98e9d7cf6e9d2ec1cce45e8d5ccaf3f9b386f3 upstream.

If __bt_get_word() is called with last_tag != 0, if the first
find_next_zero_bit() fails, if after wrap-around the
test_and_set_bit() call fails and find_next_zero_bit() succeeds,
if the next test_and_set_bit() call fails and subsequently
find_next_zero_bit() does not find a zero bit, then another
wrap-around will occur. Avoid this by introducing an additional
local variable.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Robert Elliott
Cc: Ming Lei
Cc: Alexander Gordeev
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2015-01-16 22:59:48 +0800
b041392d4 blk-mq: Fix a use-after-free ... Browse Code »

commit 45a9c9d909b24c6ad0e28a7946e7486e73010319 upstream.

blk-mq users are allowed to free the memory request_queue.tag_set
points at after blk_cleanup_queue() has finished but before
blk_release_queue() has started. This can happen e.g. in the SCSI
core. The SCSI core namely embeds the tag_set structure in a SCSI
host structure. The SCSI host structure is freed by
scsi_host_dev_release(). This function is called after
blk_cleanup_queue() finished but can be called before
blk_release_queue().

This means that it is not safe to access request_queue.tag_set from
inside blk_release_queue(). Hence remove the blk_sync_queue() call
from blk_release_queue(). This call is not necessary - outstanding
requests must have finished before blk_release_queue() is
called. Additionally, move the blk_mq_free_queue() call from
blk_release_queue() to blk_cleanup_queue() to avoid that struct
request_queue.tag_set gets accessed after it has been freed.

This patch avoids that the following kernel oops can be triggered
when deleting a SCSI host for which scsi-mq was enabled:

Call Trace:
[] lock_acquire+0xc4/0x270
[] mutex_lock_nested+0x61/0x380
[] blk_mq_free_queue+0x30/0x180
[] blk_release_queue+0x84/0xd0
[] kobject_cleanup+0x7b/0x1a0
[] kobject_put+0x30/0x70
[] blk_put_queue+0x15/0x20
[] disk_release+0x99/0xd0
[] device_release+0x36/0xb0
[] kobject_cleanup+0x7b/0x1a0
[] kobject_put+0x30/0x70
[] put_disk+0x1a/0x20
[] __blkdev_put+0x135/0x1b0
[] blkdev_put+0x50/0x160
[] kill_block_super+0x44/0x70
[] deactivate_locked_super+0x44/0x60
[] deactivate_super+0x4e/0x70
[] cleanup_mnt+0x43/0x90
[] __cleanup_mnt+0x12/0x20
[] task_work_run+0xac/0xe0
[] do_notify_resume+0x61/0xa0
[] int_signal+0x12/0x17

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Robert Elliott
Cc: Ming Lei
Cc: Alexander Gordeev
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Bart Van Assche
2015-01-16 22:59:48 +0800
00de3a642 blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map ... Browse Code »

commit a33c1ba2913802b6fb23e974bb2f6a4e73c8b7ce upstream.

We currently use num_possible_cpus(), but that breaks on sparc64 where
the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID
instead, so we don't end up reading from invalid memory.

Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Jens Axboe
2015-01-16 22:59:48 +0800

02 Dec, 2014

1 commit

594416a72 block: fix regression where bio_integrity_process uses wrong bio_vec iterator ... Browse Code »
2

bio integrity handling is broken on a system with LVM layered atop a
DIF/DIX SCSI drive because device mapper clones the bio, modifies the
clone, and sends the clone to the lower layers for processing.
However, the clone bio has bi_vcnt == 0, which means that when the sd
driver calls bio_integrity_process to attach DIX data, the
for_each_segment_all() call (which uses bi_vcnt) returns immediately
and random garbage is sent to the disk on a disk write. The disk of
course returns an error.

Therefore, teach bio_integrity_process() to use bio_for_each_segment()
to iterate the bio_vecs, since the per-bio iterator tracks which
bio_vecs are associated with that particular bio. The integrity
handling code is effectively part of the "driver" (it's not the bio
owner), so it must use the correct iterator function.

v2: Fix a compiler warning about abandoned local variables. This
patch supersedes "block: bio_integrity_process uses wrong bio_vec
iterator". Patch applies against 3.18-rc6.

Signed-off-by: Darrick J. Wong
Acked-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Darrick J. Wong
2014-12-02 23:15:21 +0800

12 Nov, 2014

1 commit

7f60dcaaf block: blk-merge: fix blk_recount_segments() ... Browse Code »

For cloned bio, bio->bi_vcnt can't be used at all, and we
have resort to bio_segments() to figure out how many
segment there are in the bio.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-11-12 07:24:15 +0800

11 Nov, 2014

1 commit

92697dc94 scsi: Fix more error handling in SCSI_IOCTL_SEND_COMMAND ... Browse Code »

Fix an error path in SCSI_IOCTL_SEND_COMMAND that calls
blk_put_request(rq) on an invalid IS_ERR(rq) pointer.

Fixes: a492f075450f ("block,scsi: fixup blk_get_request dead queue scenarios")
Signed-off-by: Tony Battersby
Signed-off-by: Jens Axboe

Tony Battersby
2014-11-11 06:41:47 +0800

05 Nov, 2014

1 commit

f3af020b9 blk-mq: make mq_queue_reinit_notify() freeze queues in parallel ... Browse Code »

q->mq_usage_counter is a percpu_ref which is killed and drained when
the queue is frozen. On a CPU hotplug event, blk_mq_queue_reinit()
which involves freezing the queue is invoked on all existing queues.
Because percpu_ref killing and draining involve a RCU grace period,
doing the above on one queue after another may take a long time if
there are many queues on the system.

This patch splits out initiation of freezing and waiting for its
completion, and updates blk_mq_queue_reinit_notify() so that the
queues are frozen in parallel instead of one after another. Note that
freezing and unfreezing are moved from blk_mq_queue_reinit() to
blk_mq_queue_reinit_notify().

Signed-off-by: Tejun Heo
Reported-by: Christian Borntraeger
Tested-by: Christian Borntraeger
Signed-off-by: Jens Axboe

Tejun Heo
2014-11-05 05:49:31 +0800

31 Oct, 2014

1 commit

ece9c72ac block: Fix computation of merged request priority ... Browse Code »
5

Priority of a merged request is computed by ioprio_best(). If one of the
requests has undefined priority (IOPRIO_CLASS_NONE) and another request
has priority from IOPRIO_CLASS_BE, the function will return the
undefined priority which is wrong. Fix the function to properly return
priority of a request with the defined priority.

Fixes: d58cdfb89ce0c6bd5f81ae931a984ef298dbda20
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jan Kara
2014-10-31 22:30:43 +0800

24 Oct, 2014

1 commit

d32f6b575 block: fix wrong error return in elevator_init() ... Browse Code »

while compiling integer err was showing as a set but unused variable.
elevator_init_fn can be either cfq_init_queue or deadline_init_queue
or noop_init_queue.
all three of these functions are returning -ENOMEM if they fail to
allocate the queue.
so we should actually be returning the error code rather than
returning 0 always.

Signed-off-by: Sudip Mukherjee
Signed-off-by: Jens Axboe

Sudip Mukherjee
2014-10-24 02:35:42 +0800

23 Oct, 2014

1 commit

84ce0f0e9 scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND ... Browse Code »
5

When sg_scsi_ioctl() fails to prepare request to submit in
blk_rq_map_kern() we jump to a label where we just end up copying
(luckily zeroed-out) kernel buffer to userspace instead of reporting
error. Fix the problem by jumping to the right label.

CC: Jens Axboe
CC: linux-scsi@vger.kernel.org
CC: stable@vger.kernel.org
Coverity-id: 1226871
Signed-off-by: Jan Kara

Fixed up the, now unused, out label.

Signed-off-by: Jens Axboe

Jan Kara
2014-10-23 10:13:39 +0800

22 Oct, 2014

1 commit

76d8137a3 blk-merge: recaculate segment if it isn't less than max segments ... Browse Code »

The problem is introduced by commit 764f612c6c3c231b(blk-merge:
don't compute bi_phys_segments from bi_vcnt for cloned bio),
and merge is needed if number of current segment isn't less than
max segments.

Strictly speaking, bio->bi_vcnt shouldn't be used here since
it may not be accurate in cases of both cloned bio or bio cloned
from, but bio_segments() is a bit expensive, and bi_vcnt is still
the biggest number, so the approach should work.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-10-22 09:00:32 +0800

19 Oct, 2014

1 commit

d3dc366bb Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block layer changes from Jens Axboe:
"This is the core block IO pull request for 3.18. Apart from the new
and improved flush machinery for blk-mq, this is all mostly bug fixes
and cleanups.

- blk-mq timeout updates and fixes from Christoph.

- Removal of REQ_END, also from Christoph. We pass it through the
->queue_rq() hook for blk-mq instead, freeing up one of the request
bits. The space was overly tight on 32-bit, so Martin also killed
REQ_KERNEL since it's no longer used.

- blk integrity updates and fixes from Martin and Gu Zheng.

- Update to the flush machinery for blk-mq from Ming Lei. Now we
have a per hardware context flush request, which both cleans up the
code should scale better for flush intensive workloads on blk-mq.

- Improve the error printing, from Rob Elliott.

- Backing device improvements and cleanups from Tejun.

- Fixup of a misplaced rq_complete() tracepoint from Hannes.

- Make blk_get_request() return error pointers, fixing up issues
where we NULL deref when a device goes bad or missing. From Joe
Lawrence.

- Prep work for drastically reducing the memory consumption of dm
devices from Junichi Nomura. This allows creating clone bio sets
without preallocating a lot of memory.

- Fix a blk-mq hang on certain combinations of queue depths and
hardware queues from me.

- Limit memory consumption for blk-mq devices for crash dump
scenarios and drivers that use crazy high depths (certain SCSI
shared tag setups). We now just use a single queue and limited
depth for that"

* 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
block: Remove REQ_KERNEL
blk-mq: allocate cpumask on the home node
bio-integrity: remove the needless fail handle of bip_slab creating
block: include func name in __get_request prints
block: make blk_update_request print prefix match ratelimited prefix
blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
block: fix alignment_offset math that assumes io_min is a power-of-2
blk-mq: Make bt_clear_tag() easier to read
blk-mq: fix potential hang if rolling wakeup depth is too high
block: add bioset_create_nobvec()
block: use bio_clone_fast() in blk_rq_prep_clone()
block: misplaced rq_complete tracepoint
sd: Honor block layer integrity handling flags
block: Replace strnicmp with strncasecmp
block: Add T10 Protection Information functions
block: Don't merge requests if integrity flags differ
block: Integrity checksum flag
block: Relocate bio integrity flags
block: Add a disk flag to block integrity profile
block: Add prefix to block integrity profile flags
...

Linus Torvalds
2014-10-19 02:53:51 +0800

14 Oct, 2014

2 commits

a86073e48 blk-mq: allocate cpumask on the home node ... Browse Code »

All other allocs are done on the specific node, somehow the
cpumask for hw queue runs was missed. Fix that by using
zalloc_cpumask_var_node() in blk_mq_init_queue().

Signed-off-by: Jens Axboe

Jens Axboe
2014-10-14 05:41:54 +0800
b65c7491c bio-integrity: remove the needless fail handle of bip_slab creating ... Browse Code »

bip_slab is created with SLAB_PANIC, so the fail handler is unneeded.

Signed-off-by: Gu Zheng
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Gu Zheng
2014-10-14 05:09:38 +0800

13 Oct, 2014

2 commits

7b2b10e0e block: include func name in __get_request prints ... Browse Code »

In __get_request calls to printk_ratelimited, include the function name so
the callbacks suppressed message matches the messages that are printed,
and add "dev" before the device name so it matches other block layer
messages.

Signed-off-by: Robert Elliott
Reviewed-by: Webb Scales
Signed-off-by: Jens Axboe

Robert Elliott
2014-10-13 22:34:23 +0800
ef3ecb66b block: make blk_update_request print prefix match ratelimited prefix ... Browse Code »

In blk_update_request, change the printk_ratelimited
prefix from end_request to blk_update_request so it
matches the name printed if rate limiting occurs.

Old:
[10234.933106] blk_update_request: 174 callbacks suppressed
[10234.934940] end_request: critical target error, dev sdr, sector 16
[10234.949788] end_request: critical target error, dev sdr, sector 16

New:
[16863.445173] blk_update_request: 398 callbacks suppressed
[16863.447029] blk_update_request: critical target error, dev sdr, sector
1442066176
[16863.449383] blk_update_request: critical target error, dev sdr, sector
802802888
[16863.451680] blk_update_request: critical target error, dev sdr, sector
1609535456

Signed-off-by: Robert Elliott
Reviewed-by: Webb Scales
Signed-off-by: Jens Axboe

Robert Elliott
2014-10-13 22:34:21 +0800

10 Oct, 2014

2 commits

c798360cd Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull percpu updates from Tejun Heo:
"A lot of activities on percpu front. Notable changes are...

- percpu allocator now can take @gfp. If @gfp doesn't contain
GFP_KERNEL, it tries to allocate from what's already available to
the allocator and a work item tries to keep the reserve around
certain level so that these atomic allocations usually succeed.

This will replace the ad-hoc percpu memory pool used by
blk-throttle and also be used by the planned blkcg support for
writeback IOs.

Please note that I noticed a bug in how @gfp is interpreted while
preparing this pull request and applied the fix 6ae833c7fe0c
("percpu: fix how @gfp is interpreted by the percpu allocator")
just now.

- percpu_ref now uses longs for percpu and global counters instead of
ints. It leads to more sparse packing of the percpu counters on
64bit machines but the overhead should be negligible and this
allows using percpu_ref for refcnting pages and in-memory objects
directly.

- The switching between percpu and single counter modes of a
percpu_ref is made independent of putting the base ref and a
percpu_ref can now optionally be initialized in single or killed
mode. This allows avoiding percpu shutdown latency for cases where
the refcounted objects may be synchronously created and destroyed
in rapid succession with only a fraction of them reaching fully
operational status (SCSI probing does this when combined with
blk-mq support). It's also planned to be used to implement forced
single mode to detect underflow more timely for debugging.

There's a separate branch percpu/for-3.18-consistent-ops which cleans
up the duplicate percpu accessors. That branch causes a number of
conflicts with s390 and other trees. I'll send a separate pull
request w/ resolutions once other branches are merged"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
percpu: fix how @gfp is interpreted by the percpu allocator
blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
percpu_ref: add PERCPU_REF_INIT_* flags
percpu_ref: decouple switching to percpu mode and reinit
percpu_ref: decouple switching to atomic mode and killing
percpu_ref: add PCPU_REF_DEAD
percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref: minor code and comment updates
percpu_ref: relocate percpu_ref_reinit()
Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
Revert "percpu: free percpu allocation info for uniprocessor system"
percpu-refcount: make percpu_ref based on longs instead of ints
percpu-refcount: improve WARN messages
percpu: fix locking regression in the failure path of pcpu_alloc()
percpu-refcount: add @gfp to percpu_ref_init()
proportions: add @gfp to init functions
percpu_counter: add @gfp to percpu_counter_init()
percpu_counter: make percpu_counters_lock irq-safe
...

Linus Torvalds
2014-10-10 19:26:02 +0800
764f612c6 blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio ... Browse Code »
13

It isn't correct to figure out req->bi_phys_segments from bio->bi_vcnt
if the bio is cloned.

Signed-off-by: Ming Lei
Tested-by: Jeff Mahoney
Signed-off-by: Jens Axboe

Ming Lei
2014-10-10 03:11:44 +0800

09 Oct, 2014

1 commit

b8839b8c5 block: fix alignment_offset math that assumes io_min is a power-of-2 ... Browse Code »
5

The math in both blk_stack_limits() and queue_limit_alignment_offset()
assume that a block device's io_min (aka minimum_io_size) is always a
power-of-2. Fix the math such that it works for non-power-of-2 io_min.

This issue (of alignment_offset != 0) became apparent when testing
dm-thinp with a thinp blocksize that matches a RAID6 stripesize of
1280K. Commit fdfb4c8c1 ("dm thin: set minimum_io_size to pool's data
block size") unlocked the potential for alignment_offset != 0 due to
the dm-thin-pool's io_min possibly being a non-power-of-2.

Signed-off-by: Mike Snitzer
Cc: stable@vger.kernel.org
Acked-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Mike Snitzer
2014-10-09 23:41:40 +0800

08 Oct, 2014

1 commit

28596c972 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull "trivial tree" updates from Jiri Kosina:
"Usual pile from trivial tree everyone is so eagerly waiting for"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
Remove MN10300_PROC_MN2WS0038
mei: fix comments
treewide: Fix typos in Kconfig
kprobes: update jprobe_example.c for do_fork() change
Documentation: change "&" to "and" in Documentation/applying-patches.txt
Documentation: remove obsolete pcmcia-cs from Changes
Documentation: update links in Changes
Documentation: Docbook: Fix generated DocBook/kernel-api.xml
score: Remove GENERIC_HAS_IOMAP
gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
tty: doc: Fix grammar in serial/tty
dma-debug: modify check_for_stack output
treewide: fix errors in printk
genirq: fix reference in devm_request_threaded_irq comment
treewide: fix synchronize_rcu() in comments
checkstack.pl: port to AArch64
doc: queue-sysfs: minor fixes
init/do_mounts: better syntax description
MIPS: fix comment spelling
powerpc/simpleboot: fix comment
...

Linus Torvalds
2014-10-08 09:16:26 +0800

07 Oct, 2014

2 commits

9d8f0bcca blk-mq: Make bt_clear_tag() easier to read ... Browse Code »

Eliminate a backwards goto statement from bt_clear_tag().

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2014-10-07 22:45:21 +0800
abab13b5c blk-mq: fix potential hang if rolling wakeup depth is too high ... Browse Code »

We currently divide the queue depth by 4 as our batch wakeup
count, but we split the wakeups over BT_WAIT_QUEUES number of
wait queues. This defaults to 8. If the product of the resulting
batch wake count and BT_WAIT_QUEUES is higher than the device
queue depth, we can get into a situation where a task goes to
sleep waiting for a request, but never gets woken up.

Reported-by: Bart Van Assche
Fixes: 4bb659b156996
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jens Axboe
2014-10-07 22:39:20 +0800

04 Oct, 2014

2 commits

d8f429e16 block: add bioset_create_nobvec() ... Browse Code »

Users of bio_clone_fast() do not want bios with their own bvecs.
Allocating a bvec mempool as part of the bioset intended for such users
is a waste of memory.

bioset_create_nobvec() creates a bioset that doesn't have the bvec
mempool.

Signed-off-by: Jun'ichi Nomura
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Junichi Nomura
2014-10-04 05:28:18 +0800
11dfce509 block: use bio_clone_fast() in blk_rq_prep_clone() ... Browse Code »

Request cloning clones bios in the request to track the completion
of each bio.
For that purpose, we can use bio_clone_fast() instead of bio_clone()
to avoid unnecessary allocation and copy of bvecs.

This patch reduces memory footprint of request-based device-mapper
(about 1-4KB for each request) and is a preparation for further
reduction of memory usage by removing unused bvec mempool.

Signed-off-by: Jun'ichi Nomura
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Junichi Nomura
2014-10-04 05:28:16 +0800

01 Oct, 2014

1 commit

4a0efdc93 block: misplaced rq_complete tracepoint ... Browse Code »

The rq_complete tracepoint was never issued for empty requests,
causing the resulting blktrace information to never show any
completion for those request.

Signed-off-by: Hannes Reinecke
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Hannes Reinecke
2014-10-01 22:17:42 +0800

28 Sep, 2014

1 commit

582940508 block: Replace strnicmp with strncasecmp ... Browse Code »

The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics
and a slightly buggy strncasecmp. The latter is the POSIX name, so
strnicmp was renamed to strncasecmp, and strnicmp made into a wrapper
for the new strncasecmp to avoid breaking existing users.

To allow the compat wrapper strnicmp to be removed at some point in
the future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.

Cc: Jens Axboe
Signed-off-by: Rasmus Villemoes
Signed-off-by: Jens Axboe

Rasmus Villemoes
2014-09-28 06:48:55 +0800

27 Sep, 2014

11 commits

2341c2f8c block: Add T10 Protection Information functions ... Browse Code »

The T10 Protection Information format is also used by some devices that
do not go through the SCSI layer (virtual block devices, NVMe). Relocate
the relevant functions to a block layer library that can be used without
involving SCSI.

Signed-off-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:59 +0800
4eaf99bea block: Don't merge requests if integrity flags differ ... Browse Code »
13

We'd occasionally merge requests with conflicting integrity flags.
Introduce a merge helper which checks that the requests have compatible
integrity payloads.

Signed-off-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:57 +0800
aae7df501 block: Integrity checksum flag ... Browse Code »

Make the choice of checksum a per-I/O property by introducing a flag
that can be inspected by the SCSI layer. There are several reasons for
this:

1. It allows us to switch choice of checksum without unloading and
reloading the HBA driver.

2. During error recovery we need to be able to tell the HBA that
checksums read from disk should not be verified and converted to IP
checksums.

3. For error injection purposes we need to be able to write a bad guard
tag to storage. Since the storage device only supports T10 CRC we
need to be able to disable IP checksum conversion on the HBA.

Signed-off-by: Martin K. Petersen
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:55 +0800
b1f013885 block: Relocate bio integrity flags ... Browse Code »

Move flags affecting the integrity code out of the bio bi_flags and into
the block integrity payload.

Signed-off-by: Martin K. Petersen
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:54 +0800
3aec2f41a block: Add a disk flag to block integrity profile ... Browse Code »
13

So far we have relied on the app tag size to determine whether a disk
has been formatted with T10 protection information or not. However, not
all target devices provide application tag storage.

Add a flag to the block integrity profile that indicates whether the
disk has been formatted with protection information.

Signed-off-by: Martin K. Petersen
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:51 +0800
8288f496e block: Add prefix to block integrity profile flags ... Browse Code »

Add a BLK_ prefix to the integrity profile flags. Also rename the flags
to be more consistent with the generate/verify terminology in the rest
of the integrity code.

Signed-off-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:51 +0800
185930885 block: Clean up the code used to generate and verify integrity metadata ... Browse Code »

Instead of the "operate" parameter we pass in a seed value and a pointer
to a function that can be used to process the integrity metadata. The
generation function is changed to have a return value to fit into this
scheme.

Signed-off-by: Martin K. Petersen
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:51 +0800
5a2aa8730 block: Make protection interval calculation generic ... Browse Code »

Now that the protection interval has been detached from the sector size
we need to be able to handle sizes that are different from 4K and
512. Make the interval calculation generic.

Signed-off-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:50 +0800
3be91c4a3 block: Deprecate the use of the term sector in the context of block integrity ... Browse Code »

The protection interval is not necessarily tied to the logical block
size of a block device. Stop using the terms "sector" and "sectors".

Going forward we will use the term "seed" to describe the initial
reference tag value for a given I/O. "Interval" will be used to describe
the portion of the data buffer that a given piece of protection
information is associated with.

Signed-off-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:50 +0800
5f9378fa9 block: Remove bip_buf ... Browse Code »

bip_buf is not really needed so we can remove it.

Signed-off-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:50 +0800
8492b68bc block: Remove integrity tagging functions ... Browse Code »

None of the filesystems appear interested in using the integrity tagging
feature. Potentially because very few storage devices actually permit
using the application tag space.

Remove the tagging functions.

Signed-off-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Reviewed-by: Sagi Grimberg
Signed-off-by: Jens Axboe

Martin K. Petersen
2014-09-27 23:14:50 +0800