Eric Lee / smarc-fsl-linux-kernel

06 Dec, 2013

2 commits

5ee540613 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block layer fixes from Jens Axboe:
"A small collection of fixes for the current series. It contains:

- A fix for a use-after-free of a request in blk-mq. From Ming Lei

- A fix for a blk-mq bug that could attempt to dereference a NULL rq
if allocation failed

- Two xen-blkfront small fixes

- Cleanup of submit_bio_wait() type uses in the kernel, unifying
that. From Kent

- A fix for 32-bit blkg_rwstat reading. I apologize for this one
looking mangled in the shortlog, it's entirely my fault for missing
an empty line between the description and body of the text"

* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: fix use-after-free of request
blk-mq: fix dereference of rq->mq_ctx if allocation fails
block: xen-blkfront: Fix possible NULL ptr dereference
xen-blkfront: Silence pfn maybe-uninitialized warning
block: submit_bio_wait() conversions
Update of blkg_stat and blkg_rwstat may happen in bh context

Linus Torvalds
2013-12-06 07:33:27 +0800
0d11e6aca blk-mq: fix use-after-free of request ... Browse Code »

If accounting is on, we will do the IO completion accounting after
we have freed the request. Fix that by moving it sooner instead.

Signed-off-by: Jens Axboe

Ming Lei
2013-12-06 01:50:39 +0800

04 Dec, 2013

1 commit

959a35f13 blk-mq: fix dereference of rq->mq_ctx if allocation fails ... Browse Code »

If __GFP_WAIT isn't set and we fail allocating, when we go
to drop the reference on the ctx, we will attempt to dereference
the NULL rq. Fix that.

Signed-off-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jeff Moyer
2013-12-04 05:24:28 +0800

25 Nov, 2013

1 commit

c170bbb45 block: submit_bio_wait() conversions ... Browse Code »

It was being open coded in a few places.

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: Joern Engel
Cc: Prasad Joshi
Cc: Neil Brown
Cc: Chris Mason
Acked-by: NeilBrown
Signed-off-by: Jens Axboe

Kent Overstreet
2013-11-25 07:33:41 +0800

22 Nov, 2013

1 commit

49204c116 block/partitions/efi.c: fix bound check ... Browse Code »

Use ARRAY_SIZE instead of sizeof to get proper max for label length.

Since this is just a read out of bounds it's not that bad, but the
problem becomes user-visible eg if one tries to use DEBUG_PAGEALLOC and
DEBUG_RODATA, at least with some enhancements from Hiroshi. Of course
the destination array can contain garbage when we read beyond the end of
source array so that would be another user-visible problem.

Signed-off-by: Antti P Miettinen
Reviewed-by: Hiroshi Doyu
Tested-by: Hiroshi Doyu
Cc: Will Drewry
Cc: Matt Fleming
Acked-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Antti P Miettinen
2013-11-22 08:42:27 +0800

21 Nov, 2013

1 commit

2c575026f Update of blkg_stat and blkg_rwstat may happen in bh context. ... Browse Code »

While u64_stats_fetch_retry is only preempt_disable on 32bit
UP system. This is not enough to avoid preemption by bh and
may read strange 64 bit value.

Signed-off-by: Hong Zhiguo
Acked-by: Tejun Heo
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Hong Zhiguo
2013-11-21 06:33:04 +0800

20 Nov, 2013

2 commits

01b983c9f blk-mq: add blktrace insert event trace ... Browse Code »

We need it to make 'btt' from blktrace happy, otherwise
we are missing one state transition.

Signed-off-by: Jens Axboe

Jens Axboe
2013-11-20 10:00:45 +0800
94eddfbea blk-mq: ensure that we set REQ_IO_STAT so diskstats work ... Browse Code »

If disk stats are enabled on the queue, a request needs to
be marked with REQ_IO_STAT for accounting to be active on
that request. This fixes an issue with virtio-blk not
showing up in /proc/diskstats after the conversion to
blk-mq.

Add QUEUE_FLAG_MQ_DEFAULT, setting stats and same cpu-group
completion on by default.

Reported-by: Dave Chinner
Signed-off-by: Jens Axboe

Jens Axboe
2013-11-20 00:25:07 +0800

16 Nov, 2013

1 commit

f412f2c60 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

Pull second round of block driver updates from Jens Axboe:
"As mentioned in the original pull request, the bcache bits were pulled
because of their dependency on the immutable bio vecs. Kent re-did
this part and resubmitted it, so here's the 2nd round of (mostly)
driver updates for 3.13. It contains:

- The bcache work from Kent.

- Conversion of virtio-blk to blk-mq. This removes the bio and request
path, and substitutes with the blk-mq path instead. The end result
almost 200 deleted lines. Patch is acked by Asias and Christoph, who
both did a bunch of testing.

- A removal of bootmem.h include from Grygorii Strashko, part of a
larger series of his killing the dependency on that header file.

- Removal of __cpuinit from blk-mq from Paul Gortmaker"

* 'for-linus' of git://git.kernel.dk/linux-block: (56 commits)
virtio_blk: blk-mq support
blk-mq: remove newly added instances of __cpuinit
bcache: defensively handle format strings
bcache: Bypass torture test
bcache: Delete some slower inline asm
bcache: Use ida for bcache block dev minor
bcache: Fix sysfs splat on shutdown with flash only devs
bcache: Better full stripe scanning
bcache: Have btree_split() insert into parent directly
bcache: Move spinlock into struct time_stats
bcache: Kill sequential_merge option
bcache: Kill bch_next_recurse_key()
bcache: Avoid deadlocking in garbage collection
bcache: Incremental gc
bcache: Add make_btree_freeing_key()
bcache: Add btree_node_write_sync()
bcache: PRECEDING_KEY()
bcache: bch_(btree|extent)_ptr_invalid()
bcache: Don't bother with bucket refcount for btree node allocations
bcache: Debug code improvements
...

Linus Torvalds
2013-11-16 08:33:41 +0800

15 Nov, 2013

1 commit

0a06ff068 kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS ... Browse Code »

We've switched over every architecture that supports SMP to it, so
remove the new useless config variable.

Signed-off-by: Christoph Hellwig
Cc: Jan Kara
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2013-11-15 08:32:22 +0800

14 Nov, 2013

4 commits

f618ef7c4 blk-mq: remove newly added instances of __cpuinit ... Browse Code »

The new blk-mq code added new instances of __cpuinit usage.
We removed this a couple versions ago; we now want to remove
the compat no-op stubs. Introducing new users is not what
we want to see at this point in time, as it will break once
the stubs are gone.

Signed-off-by: Paul Gortmaker
Signed-off-by: Jens Axboe

Paul Gortmaker
2013-11-14 23:26:02 +0800
5e30025a3 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core locking changes from Ingo Molnar:
"The biggest changes:

- add lockdep support for seqcount/seqlocks structures, this
unearthed both bugs and required extra annotation.

- move the various kernel locking primitives to the new
kernel/locking/ directory"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
block: Use u64_stats_init() to initialize seqcounts
locking/lockdep: Mark __lockdep_count_forward_deps() as static
lockdep/proc: Fix lock-time avg computation
locking/doc: Update references to kernel/mutex.c
ipv6: Fix possible ipv6 seqlock deadlock
cpuset: Fix potential deadlock w/ set_mems_allowed
seqcount: Add lockdep functionality to seqcount/seqlock structures
net: Explicitly initialize u64_stats_sync structures for lockdep
locking: Move the percpu-rwsem code to kernel/locking/
locking: Move the lglocks code to kernel/locking/
locking: Move the rwsem code to kernel/locking/
locking: Move the rtmutex code to kernel/locking/
locking: Move the semaphore core to kernel/locking/
locking: Move the spinlock code to kernel/locking/
locking: Move the lockdep code to kernel/locking/
locking: Move the mutex code to kernel/locking/
hung_task debugging: Add tracepoint to report the hang
x86/locking/kconfig: Update paravirt spinlock Kconfig description
lockstat: Report avg wait and hold times
lockdep, x86/alternatives: Drop ancient lockdep fixup message
...

Linus Torvalds
2013-11-14 15:30:30 +0800
0910c0bdf Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO core updates from Jens Axboe:
"This is the pull request for the core changes in the block layer for
3.13. It contains:

- The new blk-mq request interface.

This is a new and more scalable queueing model that marries the
best part of the request based interface we currently have (which
is fully featured, but scales poorly) and the bio based "interface"
which the new drivers for high IOPS devices end up using because
it's much faster than the request based one.

The bio interface has no block layer support, since it taps into
the stack much earlier. This means that drivers end up having to
implement a lot of functionality on their own, like tagging,
timeout handling, requeue, etc. The blk-mq interface provides all
these. Some drivers even provide a switch to select bio or rq and
has code to handle both, since things like merging only works in
the rq model and hence is faster for some workloads. This is a
huge mess. Conversion of these drivers nets us a substantial code
reduction. Initial results on converting SCSI to this model even
shows an 8x improvement on single queue devices. So while the
model was intended to work on the newer multiqueue devices, it has
substantial improvements for "classic" hardware as well. This code
has gone through extensive testing and development, it's now ready
to go. A pull request is coming to convert virtio-blk to this
model will be will be coming as well, with more drivers scheduled
for 3.14 conversion.

- Two blktrace fixes from Jan and Chen Gang.

- A plug merge fix from Alireza Haghdoost.

- Conversion of __get_cpu_var() from Christoph Lameter.

- Fix for sector_div() with 64-bit divider from Geert Uytterhoeven.

- A fix for a race between request completion and the timeout
handling from Jeff Moyer. This is what caused the merge conflict
with blk-mq/core, in case you are looking at that.

- A dm stacking fix from Mike Snitzer.

- A code consolidation fix and duplicated code removal from Kent
Overstreet.

- A handful of block bug fixes from Mikulas Patocka, fixing a loop
crash and memory corruption on blk cg.

- Elevator switch bug fix from Tomoki Sekiyama.

A heads-up that I had to rebase this branch. Initially the immutable
bio_vecs had been queued up for inclusion, but a week later, it became
clear that it wasn't fully cooked yet. So the decision was made to
pull this out and postpone it until 3.14. It was a straight forward
rebase, just pruning out the immutable series and the later fixes of
problems with it. The rest of the patches applied directly and no
further changes were made"

* 'for-3.13/core' of git://git.kernel.dk/linux-block: (31 commits)
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
block: Do not call sector_div() with a 64-bit divisor
kernel: trace: blktrace: remove redundent memcpy() in compat_blk_trace_setup()
block: Consolidate duplicated bio_trim() implementations
block: Use rw_copy_check_uvector()
block: Enable sysfs nomerge control for I/O requests in the plug list
block: properly stack underlying max_segment_size to DM device
elevator: acquire q->sysfs_lock in elevator_change()
elevator: Fix a race in elevator switching and md device initialization
block: Replace __get_cpu_var uses
bdi: test bdi_init failure
block: fix a probe argument to blk_register_region
loop: fix crash if blk_alloc_queue fails
blk-core: Fix memory corruption if blkcg_init_queue fails
block: fix race between request completion and timeout handling
blktrace: Send BLK_TN_PROCESS events to all running traces
blk-mq: don't disallow request merges for req->special being set
blk-mq: mq plug list breakage
blk-mq: fix for flush deadlock
...

Linus Torvalds
2013-11-14 11:08:14 +0800
8ceafbfa9 Merge branch 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm ... Browse Code »

Pull DMA mask updates from Russell King:
"This series cleans up the handling of DMA masks in a lot of drivers,
fixing some bugs as we go.

Some of the more serious errors include:
- drivers which only set their coherent DMA mask if the attempt to
set the streaming mask fails.
- drivers which test for a NULL dma mask pointer, and then set the
dma mask pointer to a location in their module .data section -
which will cause problems if the module is reloaded.

To counter these, I have introduced two helper functions:
- dma_set_mask_and_coherent() takes care of setting both the
streaming and coherent masks at the same time, with the correct
error handling as specified by the API.
- dma_coerce_mask_and_coherent() which resolves the problem of
drivers forcefully setting DMA masks. This is more a marker for
future work to further clean these locations up - the code which
creates the devices really should be initialising these, but to fix
that in one go along with this change could potentially be very
disruptive.

The last thing this series does is prise away some of Linux's addition
to "DMA addresses are physical addresses and RAM always starts at
zero". We have ARM LPAE systems where all system memory is above 4GB
physical, hence having DMA masks interpreted by (eg) the block layers
as describing physical addresses in the range 0..DMAMASK fails on
these platforms. Santosh Shilimkar addresses this in this series; the
patches were copied to the appropriate people multiple times but were
ignored.

Fixing this also gets rid of some ARM weirdness in the setup of the
max*pfn variables, and brings ARM into line with every other Linux
architecture as far as those go"

* 'for-linus-dma-masks' of git://git.linaro.org/people/rmk/linux-arm: (52 commits)
ARM: 7805/1: mm: change max*pfn to include the physical offset of memory
ARM: 7797/1: mmc: Use dma_max_pfn(dev) helper for bounce_limit calculations
ARM: 7796/1: scsi: Use dma_max_pfn(dev) helper for bounce_limit calculations
ARM: 7795/1: mm: dma-mapping: Add dma_max_pfn(dev) helper function
ARM: 7794/1: block: Rename parameter dma_mask to max_addr for blk_queue_bounce_limit()
ARM: DMA-API: better handing of DMA masks for coherent allocations
ARM: 7857/1: dma: imx-sdma: setup dma mask
DMA-API: firmware/google/gsmi.c: avoid direct access to DMA masks
DMA-API: dcdbas: update DMA mask handing
DMA-API: dma: edma.c: no need to explicitly initialize DMA masks
DMA-API: usb: musb: use platform_device_register_full() to avoid directly messing with dma masks
DMA-API: crypto: remove last references to 'static struct device *dev'
DMA-API: crypto: fix ixp4xx crypto platform device support
DMA-API: others: use dma_set_coherent_mask()
DMA-API: staging: use dma_set_coherent_mask()
DMA-API: usb: use new dma_coerce_mask_and_coherent()
DMA-API: usb: use dma_set_coherent_mask()
DMA-API: parport: parport_pc.c: use dma_coerce_mask_and_coherent()
DMA-API: net: octeon: use dma_coerce_mask_and_coherent()
DMA-API: net: nxp/lpc_eth: use dma_coerce_mask_and_coherent()
...

Linus Torvalds
2013-11-14 06:55:21 +0800

13 Nov, 2013

1 commit

90d3839b9 block: Use u64_stats_init() to initialize seqcounts ... Browse Code »

Now that seqcounts are lockdep enabled objects, we need to explicitly
initialize runtime allocated seqcounts so that lockdep can track them.

Without this patch, Fengguang was seeing:

[ 4.127282] INFO: trying to register non-static key.
[ 4.128027] the code is fine but needs lockdep annotation.
[ 4.128027] turning off the locking correctness validator.
[ 4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2
[ 4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ ... ]
[ 4.128027] Call Trace:
[ 4.128027] [] ? console_unlock+0x353/0x380
[ 4.128027] [] dump_stack+0x48/0x60
[ 4.128027] [] __lock_acquire.isra.26+0x7e3/0xceb
[ 4.128027] [] lock_acquire+0x71/0x9a
[ 4.128027] [] ? blk_throtl_bio+0x1c3/0x485
[ 4.128027] [] throtl_update_dispatch_stats+0x7c/0x153
[ 4.128027] [] ? blk_throtl_bio+0x1c3/0x485
[ 4.128027] [] blk_throtl_bio+0x1c3/0x485
...

Use u64_stats_init() for all affected data structures, which initializes
the seqcount.

Reported-and-Tested-by: Fengguang Wu
Cc: Vivek Goyal
Cc: Jens Axboe
Signed-off-by: Peter Zijlstra
[ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ]
Signed-off-by: John Stultz
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org
[ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ]
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-11-13 20:54:08 +0800

09 Nov, 2013

10 commits

d17ab4592 block: cleanup removing dependency on bootmem headers ... Browse Code »

Cc: Yinghai Lu
Cc: Tejun Heo
Cc: Andrew Morton

Signed-off-by: Grygorii Strashko
Signed-off-by: Santosh Shilimkar
Signed-off-by: Jens Axboe

Grygorii Strashko
2013-11-09 10:43:48 +0800
e37459b8e Merge branch 'blk-mq/core' into for-3.13/core ... Browse Code »

Signed-off-by: Jens Axboe

Conflicts:
block/blk-timeout.c

Jens Axboe
2013-11-09 00:08:12 +0800
c7d1ba417 block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ... Browse Code »

This patch fixes coccinelle error regarding usage of IS_ERR and
PTR_ERR instead of PTR_ERR_OR_ZERO.

Signed-off-by: Duan Jiong
Signed-off-by: Jens Axboe

Duan Jiong
2013-11-09 00:05:31 +0800
8616ebb16 block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO ... Browse Code »

This patch fixes coccinelle error regarding usage of IS_ERR and
PTR_ERR instead of PTR_ERR_OR_ZERO.

Signed-off-by: Duan Jiong
Signed-off-by: Jens Axboe

Duan Jiong
2013-11-09 00:05:30 +0800
97597dc08 block: Do not call sector_div() with a 64-bit divisor ... Browse Code »

do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions
of 64-bit number by 32-bit numbers. Passing 64-bit divisor types caused
issues in the past on 32-bit platforms, cfr. commit
ea077b1b96e073eac5c3c5590529e964767fc5f7 ("m68k: Truncate base in
do_div()").

As queue_limits.max_discard_sectors and .discard_granularity are unsigned
int, max_discard_sectors and granularity should be unsigned int.
As bdev_discard_alignment() returns int, alignment should be int.
Now 2 calls to sector_div() can be replaced by 32-bit arithmetic:
- The 64-bit modulo operation can become a 32-bit modulo operation,
- The 64-bit division and multiplication can be replaced by a 32-bit
modulo operation and a subtraction.

Signed-off-by: Geert Uytterhoeven
Signed-off-by: Jens Axboe

Geert Uytterhoeven
2013-11-09 00:04:46 +0800
e0ce0eacb block: Use rw_copy_check_uvector() ... Browse Code »

No need for silly open coding - and struct sg_iovec has exactly the same
layout as struct iovec...

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Kent Overstreet
2013-11-09 00:02:14 +0800
23779fbc9 block: Enable sysfs nomerge control for I/O requests in the plug list ... Browse Code »

This patch enables the sysfs to control I/O request merge
functionality in the plug list. While this control has been
implemented for the request queue, it was dismissed in the plug list.
Therefore, block layer merges requests together (or attempt to merge)
even if the merge capability was disable using sysfs nomerge parameter
value 2.

This limitation is directly affects functionality of io_submit()
system call. The system call enables user to submit a bunch of IO
requests from user space using struct iocb **ios input argument.
However, the unconditioned merging functionality in the plug list
potentially merges these requests together down the road. Therefore,
there is no way to distinguish between an application sending bunch of
sequential IOs and an application sending one big IO. Ultimately, all
requests generated by the former app merge within the plug list
together and looks similar to the second app.

While the merging functionality is a desirable feature to improve the
performance of IO subsystem for some applications, it is not useful
for other application like ours at all.

Signed-off-by: Alireza Haghdoost
Reviewed-by: Jeff Moyer

Coding style modified.

Signed-off-by: Jens Axboe

Alireza Haghdoost
2013-11-09 00:00:22 +0800
d82ae52e6 block: properly stack underlying max_segment_size to DM device ... Browse Code »

Without this patch all DM devices will default to BLK_MAX_SEGMENT_SIZE
(65536) even if the underlying device(s) have a larger value -- this is
due to blk_stack_limits() using min_not_zero() when stacking the
max_segment_size limit.

1073741824

before patch:
65536

after patch:
1073741824

Reported-by: Lukasz Flis
Signed-off-by: Mike Snitzer
Cc: stable@vger.kernel.org # v3.3+
Signed-off-by: Jens Axboe

Mike Snitzer
2013-11-09 00:00:17 +0800
7c8a3679e elevator: acquire q->sysfs_lock in elevator_change() ... Browse Code »

Add locking of q->sysfs_lock into elevator_change() (an exported function)
to ensure it is held to protect q->elevator from elevator_init(), even if
elevator_change() is called from non-sysfs paths.
sysfs path (elv_iosched_store) uses __elevator_change(), non-locking
version, as the lock is already taken by elv_iosched_store().

Signed-off-by: Tomoki Sekiyama
Signed-off-by: Jens Axboe

Tomoki Sekiyama
2013-11-09 00:00:13 +0800
eb1c160b2 elevator: Fix a race in elevator switching and md device initialization ... Browse Code »

The soft lockup below happens at the boot time of the system using dm
multipath and the udev rules to switch scheduler.

[ 356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
[ 356.127001] RIP: 0010:[] [] lock_timer_base.isra.35+0x1d/0x50
...
[ 356.127001] Call Trace:
[ 356.127001] [] try_to_del_timer_sync+0x20/0x70
[ 356.127001] [] ? kmem_cache_alloc_node_trace+0x20a/0x230
[ 356.127001] [] del_timer_sync+0x52/0x60
[ 356.127001] [] cfq_exit_queue+0x32/0xf0
[ 356.127001] [] elevator_exit+0x2f/0x50
[ 356.127001] [] elevator_change+0xf1/0x1c0
[ 356.127001] [] elv_iosched_store+0x20/0x50
[ 356.127001] [] queue_attr_store+0x59/0xb0
[ 356.127001] [] sysfs_write_file+0xc6/0x140
[ 356.127001] [] vfs_write+0xbd/0x1e0
[ 356.127001] [] SyS_write+0x49/0xa0
[ 356.127001] [] system_call_fastpath+0x16/0x1b

This is caused by a race between md device initialization by multipathd and
shell script to switch the scheduler using sysfs.

- multipathd:
SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
-> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
q->elevator = elevator_alloc(q, e); // not yet initialized

- sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
elevator_switch (in the call trace above)
struct elevator_queue *old = q->elevator;
q->elevator = elevator_alloc(q, new_e);
elevator_exit(old); // lockup! (*)

- multipathd: (cont.)
err = e->ops.elevator_init_fn(q); // init fails; q->elevator is modified

(*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
while timer->base == NULL. In this case, as timer will never initialized,
it results in lockup.

This patch introduces acquisition of q->sysfs_lock around elevator_init()
into blk_init_allocated_queue(), to provide mutual exclusion between
initialization of the q->scheduler and switching of the scheduler.

This should fix this bugzilla:
https://bugzilla.redhat.com/show_bug.cgi?id=902012

Signed-off-by: Tomoki Sekiyama
Signed-off-by: Jens Axboe

Tomoki Sekiyama
2013-11-09 00:00:08 +0800

08 Nov, 2013

3 commits

170d800af block: Replace __get_cpu_var uses ... Browse Code »

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x). This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area. __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset. Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e. using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)

Converts to

int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);

Converts to

memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;

Converts to

this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++

Converts to

this_cpu_inc(y)

Signed-off-by: Christoph Lameter
Signed-off-by: Jens Axboe

Christoph Lameter
2013-11-08 23:59:58 +0800
fff4996b7 blk-core: Fix memory corruption if blkcg_init_queue fails ... Browse Code »

If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
to clean up structures allocated by the backing dev.

------------[ cut here ]------------
WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
ODEBUG: free active (active state 0) object type: percpu_counter hint: (null)
Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
CPU: 0 PID: 2739 Comm: lvchange Tainted: G W
3.10.15-devel #14
Hardware name: empty empty/S3992-E, BIOS 'V1.06 ' 06/09/2009
0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
Call Trace:
[] dump_stack+0x19/0x1b
[] warn_slowpath_common+0x6b/0xa0
[] warn_slowpath_fmt+0x47/0x50
[] ? debug_check_no_obj_freed+0xcf/0x250
[] debug_print_object+0x85/0xa0
[] debug_check_no_obj_freed+0x203/0x250
[] kmem_cache_free+0x20c/0x3a0
[] blk_alloc_queue_node+0x2a9/0x2c0
[] blk_alloc_queue+0xe/0x10
[] dm_create+0x1a3/0x530 [dm_mod]
[] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[] dev_create+0x57/0x2b0 [dm_mod]
[] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[] ? list_version_get_info+0xe0/0xe0 [dm_mod]
[] ctl_ioctl+0x268/0x500 [dm_mod]
[] ? get_lock_stats+0x22/0x70
[] dm_ctl_ioctl+0xe/0x20 [dm_mod]
[] do_vfs_ioctl+0x2ed/0x520
[] ? fget_light+0x377/0x4e0
[] SyS_ioctl+0x4b/0x90
[] system_call_fastpath+0x1a/0x1f
---[ end trace 4b5ff0d55673d986 ]---
------------[ cut here ]------------

This fix should be backported to stable kernels starting with 2.6.37. Note
that in the kernels prior to 3.5 the affected code is different, but the
bug is still there - bdi_init is called and bdi_destroy isn't.

Signed-off-by: Mikulas Patocka
Acked-by: Tejun Heo
Cc: stable@kernel.org # 2.6.37+
Signed-off-by: Jens Axboe

Mikulas Patocka
2013-11-08 23:59:17 +0800
4912aa6c1 block: fix race between request completion and timeout handling ... Browse Code »

crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 491, comm: scsi_eh_0 Tainted: G W ---------------- 2.6.32-220.13.1.el6.x86_64 #1 IBM -[8722PAX]-/00D1461
RIP: 0010:[] [] blk_requeue_request+0x94/0xa0
RSP: 0018:ffff881057eefd60 EFLAGS: 00010012
RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
FS: 0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
Stack:
0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
Call Trace:
[] __scsi_queue_insert+0xa3/0x150
[] ? scsi_eh_ready_devs+0x5e3/0x850
[] scsi_queue_insert+0x13/0x20
[] scsi_eh_flush_done_q+0x104/0x160
[] scsi_error_handler+0x35b/0x660
[] ? scsi_error_handler+0x0/0x660
[] kthread+0x96/0xa0
[] child_rip+0xa/0x20
[] ? kthread+0x0/0xa0
[] ? child_rip+0x0/0x20
Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
RIP [] blk_requeue_request+0x94/0xa0
RSP

The RIP is this line:
BUG_ON(blk_queued_rq(rq));

After digging through the code, I think there may be a race between the
request completion and the timer handler running.

A timer is started for each request put on the device's queue (see
blk_start_request->blk_add_timer). If the request does not complete
before the timer expires, the timer handler (blk_rq_timed_out_timer)
will mark the request complete atomically:

static inline int blk_mark_rq_complete(struct request *rq)
{
return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
}

and then call blk_rq_timed_out. The latter function will call
scsi_times_out, which will return one of BLK_EH_HANDLED,
BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED. If BLK_EH_RESET_TIMER is
returned, blk_clear_rq_complete is called, and blk_add_timer is again
called to simply wait longer for the request to complete.

Now, if the request happens to complete while this is going on, what
happens? Given that we know the completion handler will bail if it
finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
handler running after that bit is cleared. So, from the above
paragraph, after the call to blk_clear_rq_complete. If the completion
sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
there (I haven't seen this in the cores). Next, if we get the
completion before the call to list_add_tail, then the timer will
eventually fire for an old req, which may either be freed or reallocated
(there is evidence that this might be the case). Finally, if the
completion comes in *after* the addition to the timeout list, I think
it's harmless. The request will be removed from the timeout list,
req_atom_complete will be set, and all will be well.

This will only actually explain the coredumps *IF* the request
structure was freed, reallocated *and* queued before the error handler
thread had a chance to process it. That is possible, but it may make
sense to keep digging for another race. I think that if this is what
was happening, we would see other instances of this problem showing up
as null pointer or garbage pointer dereferences, for example when the
request structure was not re-used. It looks like we actually do run
into that situation in other reports.

This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
&req->atomic_flags)); from blk_add_timer to the only caller that could
trip over it (blk_start_request). It then inverts the calls to
blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
the race. I've boot tested this patch, but nothing more.

Signed-off-by: Jeff Moyer
Acked-by: Hannes Reinecke
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jeff Moyer
2013-11-08 23:59:04 +0800

31 Oct, 2013

1 commit

9f7e45d83 ARM: 7794/1: block: Rename parameter dma_mask to max_addr for blk_queue_bounce_limit() ... Browse Code »

The blk_queue_bounce_limit() API parameter 'dma_mask' is actually the
maximum address the device can handle rather than a dma_mask. Rename
it accordingly to avoid it being interpreted as dma_mask.

No functional change.

The idea is to fix the bad assumptions about dma_mask wherever it could
be miss-interpreted.

Cc: Jens Axboe
Signed-off-by: Santosh Shilimkar
Signed-off-by: Russell King

Santosh Shilimkar
2013-10-31 22:49:22 +0800

30 Oct, 2013

2 commits

e7e245000 blk-mq: don't disallow request merges for req->special being set ... Browse Code »

For blk-mq, if a driver has requested per-request payload data
to carry command structures, they are stuffed into req->special.
For an old style request based driver, req->special is used
for the same purpose but indicates that a per-driver request
structure has been prepared for the request already. So for the
old style driver, we do not merge such requests.

As most/all blk-mq drivers will use the payload feature, and
since we have no problem merging on these, make this check
dependent on whether it's a blk-mq enabled driver or not.

Reported-by: Shaohua Li
Signed-off-by: Jens Axboe

Jens Axboe
2013-10-30 02:11:47 +0800
92f399c72 blk-mq: mq plug list breakage ... Browse Code »

We switched to plug mq_list for mq, but some code are still using old list.

Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2013-10-30 02:01:03 +0800

29 Oct, 2013

1 commit

3228f48be blk-mq: fix for flush deadlock ... Browse Code »

The flush state machine takes in a struct request, which then is
submitted multiple times to the underling driver. The old block code
requeses the same request for each of those, so it does not have an
issue with tapping into the request pool. The new one on the other hand
allocates a new request for each of the actualy steps of the flush
sequence. If have already allocated all of the tags for IO, we will
fail allocating the flush request.

Set aside a reserved request just for flushes.

Signed-off-by: Jens Axboe

Christoph Hellwig
2013-10-29 03:33:58 +0800

25 Oct, 2013

4 commits

280d45f6c blk-mq: add blk_mq_stop_hw_queues ... Browse Code »

Add a helper to iterate over all hw queues and stop them. This is useful
for driver that implement PM suspend functionality.

Signed-off-by: Christoph Hellwig

Modified to just call blk_mq_stop_hw_queue() by Jens.

Signed-off-by: Jens Axboe

Christoph Hellwig
2013-10-25 21:45:58 +0800
320ae51fe blk-mq: new multi-queue block IO queueing mechanism ... Browse Code »

Linux currently has two models for block devices:

- The classic request_fn based approach, where drivers use struct
request units for IO. The block layer provides various helper
functionalities to let drivers share code, things like tag
management, timeout handling, queueing, etc.

- The "stacked" approach, where a driver squeezes in between the
block layer and IO submitter. Since this bypasses the IO stack,
driver generally have to manage everything themselves.

With drivers being written for new high IOPS devices, the classic
request_fn based driver doesn't work well enough. The design dates
back to when both SMP and high IOPS was rare. It has problems with
scaling to bigger machines, and runs into scaling issues even on
smaller machines when you have IOPS in the hundreds of thousands
per device.

The stacked approach is then most often selected as the model
for the driver. But this means that everybody has to re-invent
everything, and along with that we get all the problems again
that the shared approach solved.

This commit introduces blk-mq, block multi queue support. The
design is centered around per-cpu queues for queueing IO, which
then funnel down into x number of hardware submission queues.
We might have a 1:1 mapping between the two, or it might be
an N:M mapping. That all depends on what the hardware supports.

blk-mq provides various helper functions, which include:

- Scalable support for request tagging. Most devices need to
be able to uniquely identify a request both in the driver and
to the hardware. The tagging uses per-cpu caches for freed
tags, to enable cache hot reuse.

- Timeout handling without tracking request on a per-device
basis. Basically the driver should be able to get a notification,
if a request happens to fail.

- Optional support for non 1:1 mappings between issue and
submission queues. blk-mq can redirect IO completions to the
desired location.

- Support for per-request payloads. Drivers almost always need
to associate a request structure with some driver private
command structure. Drivers can tell blk-mq this at init time,
and then any request handed to the driver will have the
required size of memory associated with it.

- Support for merging of IO, and plugging. The stacked model
gets neither of these. Even for high IOPS devices, merging
sequential IO reduces per-command overhead and thus
increases bandwidth.

For now, this is provided as a potential 3rd queueing model, with
the hope being that, as it matures, it can replace both the classic
and stacked model. That would get us back to having just 1 real
model for block devices, leaving the stacked approach to dm/md
devices (as it was originally intended).

Contributions in this patch from the following people:

Shaohua Li
Alexander Gordeev
Christoph Hellwig
Mike Christie
Matias Bjorling
Jeff Moyer

Acked-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2013-10-25 18:56:00 +0800
71fe07d04 block: remove request ref_count ... Browse Code »

This reference count has been around since before git history, but the only
place where it's used is in blk_execute_rq, and ther it is entirely useless
as it is incremented before submitting the request and decremented in the
end_io handler before waking up the submitter thread.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2013-10-25 18:55:59 +0800
5953316db block: make rq->cmd_flags be 64-bit ... Browse Code »

We have officially run out of flags in a 32-bit space. Extend it
to 64-bit even on 32-bit archs.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2013-10-25 18:55:59 +0800

17 Oct, 2013

1 commit

87fc0ad2a block/partitions/efi.c: treat size mismatch as a warning, not an error ... Browse Code »

In commit 27a7c642174e ("partitions/efi: account for pmbr size in lba")
we started treating bad sizes in lba field of the partition that has the
0xEE (GPT protective) as errors.

However, we may run into these "bad sizes" in the real world if someone
uses dd to copy an image from a smaller disk to a bigger disk. Since
this case used to work (even without using force_gpt), keep it working
and treat the size mismatch as a warning instead of an error.

Reported-by: Josh Triplett
Reported-by: Sean Paul
Signed-off-by: Doug Anderson
Reviewed-by: Josh Triplett
Acked-by: Davidlohr Bueso
Tested-by: Artem Bityutskiy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Doug Anderson
2013-10-17 12:35:53 +0800

01 Oct, 2013

1 commit

080506ad0 block: change config option name for cmdline partition parsing ... Browse Code »

Recently commit bab55417b10c ("block: support embedded device command
line partition") introduced CONFIG_CMDLINE_PARSER. However, that name
is too generic and sounds like it enables/disables generic kernel boot
arg processing, when it really is block specific.

Before this option becomes a part of a full/final release, add the BLK_
prefix to it so that it is clear in absence of any other context that it
is block specific.

In addition, fix up the following less critical items:
- help text was not really at all helpful.
- index file for Documentation was not updated
- add the new arg to Documentation/kernel-parameters.txt
- clarify wording in source comments

Signed-off-by: Paul Gortmaker
Cc: Jens Axboe
Cc: Cai Zhiyong
Cc: Wei Yongjun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Gortmaker
2013-10-01 05:31:02 +0800

23 Sep, 2013

2 commits

68cf8d0c7 Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO fixes from Jens Axboe:
"After merge window, no new stuff this time only a collection of neatly
confined and simple fixes"

* 'for-3.12/core' of git://git.kernel.dk/linux-block:
cfq: explicitly use 64bit divide operation for 64bit arguments
block: Add nr_bios to block_rq_remap tracepoint
If the queue is dying then we only call the rq->end_io callout. This leaves bios setup on the request, because the caller assumes when the blk_execute_rq_nowait/blk_execute_rq call has completed that the rq->bios have been cleaned up.
bio-integrity: Fix use of bs->bio_integrity_pool after free
blkcg: relocate root_blkg setting and clearing
block: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)
block: trace all devices plug operation

Linus Torvalds
2013-09-23 06:00:11 +0800
f3cff25f0 cfq: explicitly use 64bit divide operation for 64bit arguments ... Browse Code »

'samples' is 64bit operant, but do_div() second parameter is 32.
do_div silently truncates high 32 bits and calculated result
is invalid.

In case if low 32bit of 'samples' are zeros then do_div() produces
kernel crash.

Signed-off-by: Anatol Pomozov
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Anatol Pomozov
2013-09-23 02:43:47 +0800