Eric Lee / smarc-fsl-linux-kernel

16 Dec, 2011

1 commit

4eabc9412 block: don't kick empty queue in blk_drain_queue() ... Browse Code »

While probing, fd sets up queue, probes hardware and tears down the
queue if probing fails. In the process, blk_drain_queue() kicks the
queue which failed to finish initialization and fd is unhappy about
that.

floppy0: no floppy controllers found
------------[ cut here ]------------
WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
Hardware name: To Be Filled By O.E.M.
VFS: do_fd_request called on non-open device
Modules linked in:
Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2
Call Trace:
[] warn_slowpath_common+0x7a/0xb0
[] warn_slowpath_fmt+0x41/0x50
[] do_fd_request+0xbf/0xd0
[] blk_drain_queue+0x65/0x80
[] blk_cleanup_queue+0xe3/0x1a0
[] floppy_init+0xdeb/0xe28
[] ? daring+0x6b/0x6b
[] do_one_initcall+0x3f/0x170
[] kernel_init+0x9d/0x11e
[] ? schedule_tail+0x22/0xa0
[] kernel_thread_helper+0x4/0x10
[] ? start_kernel+0x2be/0x2be
[] ? gs_change+0xb/0xb

Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
something on it.

Signed-off-by: Tejun Heo
Reported-by: Ralf Hildebrandt
Reported-by: Wu Fengguang
Tested-by: Sergei Trofimovich
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-16 03:03:04 +0800

23 Nov, 2011

1 commit

5151412dd block: initialize request_queue's numa node during ... Browse Code »
1

struct request_queue is allocated with __GFP_ZERO so its "node" field is
zero before initialization. This causes an oops if node 0 is offline in
the page allocator because its zonelists are not initialized. From Dave
Young's dmesg:

SRAT: Node 1 PXM 2 0-d0000000
SRAT: Node 1 PXM 2 100000000-330000000
SRAT: Node 0 PXM 1 330000000-630000000
Initmem setup node 1 0000000000000000-000000000affb000
...
Built 1 zonelists in Node order, mobility grouping on.
...
BUG: unable to handle kernel paging request at 0000000000001c08
IP: [] __alloc_pages_nodemask+0xb5/0x870

and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
zonelist->_zonerefs.

The fix is to initialize q->node at the time of allocation so the correct
node is passed to the slab allocator later.

Since blk_init_allocated_queue_node() is no longer needed, merge it with
blk_init_allocated_queue().

[rientjes@google.com: changelog, initializing q->node]
Cc: stable@vger.kernel.org [2.6.37+]
Reported-by: Dave Young
Signed-off-by: Mike Snitzer
Signed-off-by: David Rientjes
Tested-by: Dave Young
Signed-off-by: Jens Axboe

Mike Snitzer
2011-11-23 17:59:13 +0800

16 Nov, 2011

2 commits

019ceb7d5 block: add missed trace_block_plug ... Browse Code »

After flush plug list, the list has no request, so we need to add a
trace_block_plug().

Signed-off-by: Shaohua Li
Reviewed-by: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Shaohua Li
2011-11-16 16:21:50 +0800
3540d5e89 block: avoid unnecessary plug list flush ... Browse Code »

get_request_wait() could sleep and flush the plug list. If the list is
already flushed, don't flush again.

Signed-off-by: Shaohua Li
Reviewed-by: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Shaohua Li
2011-11-16 16:21:50 +0800

04 Nov, 2011

1 commit

6dd9ad7df block: don't call blk_drain_queue() if elevator is not up ... Browse Code »

blk_cleanup_queue() may be called before elevator is set up on a
queue which triggers the following oops.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] elv_drain_elevator+0x1c/0x70
...
Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
Bochs Bochs
RIP: 0010:[] [] elv_drain_elevator+0x1c/0x70
...
Call Trace:
[] blk_drain_queue+0x42/0x70
[] blk_cleanup_queue+0xd0/0x1c0
[] md_free+0x50/0x70
[] kobject_release+0x8b/0x1d0
[] kref_put+0x36/0xa0
[] kobject_put+0x27/0x60
[] mddev_delayed_delete+0x2f/0x40
[] process_one_work+0x100/0x3b0
[] worker_thread+0x15f/0x3a0
[] kthread+0x87/0x90
[] kernel_thread_helper+0x4/0x10

Fix it by making blk_cleanup_queue() check whether q->elevator is set
up before invoking blk_drain_queue.

Signed-off-by: Tejun Heo
Reported-and-tested-by: Jiri Slaby
Signed-off-by: Jens Axboe

Tejun Heo
2011-11-04 01:52:11 +0800

24 Oct, 2011

3 commits

83157223d Merge branch 'for-linus' into for-3.2/core Browse Code »

Jens Axboe
2011-10-24 22:24:38 +0800
e67b77c79 blk-flush: move the queue kick into ... Browse Code »

A dm-multipath user reported[1] a problem when trying to boot
a kernel with commit 4853abaae7e4a2af938115ce9071ef8684fb7af4
(block: fix flush machinery for stacking drivers with differring
flush flags) applied. It turns out that an empty flush request
can be sent into blk_insert_flush. When the BUG_ON was fixed
to allow for this, I/O on the underlying device would stall. The
reason is that blk_insert_cloned_request does not kick the queue.
In the aforementioned commit, I had added a special case to
kick the queue if data was sent down but the queue flags did
not require a flush. A better solution is to push the queue
kick up into blk_insert_cloned_request.

This patch, along with a follow-on which fixes the BUG_ON, fixes
the issue reported.

[1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html

Reported-by: Christophe Saout
Signed-off-by: Jeff Moyer
Acked-by: Tejun Heo

Stable note: 3.1
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

Jeff Moyer
2011-10-24 22:24:31 +0800
9562ad9ab block: Remove the control of complete cpu from bio. ... Browse Code »

bio originally has the functionality to set the complete cpu, but
it is broken.

Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken. The only thing keeping
it serves is creating more confusion and possibly more bugs."

And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".

So this patch tries to remove all the work of controling complete cpu
from a bio.

Cc: Shaohua Li
Cc: Christoph Hellwig
Signed-off-by: Tao Ma
Signed-off-by: Jens Axboe

Tao Ma
2011-10-24 22:11:30 +0800

19 Oct, 2011

7 commits

c9a929dde block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown ... Browse Code »

request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.

This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.

With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.

sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:

Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[] [] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[] elv_merge+0x84/0xe0
[] blk_queue_bio+0xf4/0x400
[] generic_make_request+0xca/0x100
[] submit_bio+0x74/0x100
[] dio_bio_submit+0xbc/0xc0
[] __blockdev_direct_IO+0x92e/0xb40
[] blkdev_direct_IO+0x57/0x60
[] generic_file_aio_read+0x6d5/0x760
[] do_sync_read+0xda/0x120
[] vfs_read+0xc5/0x180
[] sys_pread64+0x9a/0xb0
[] system_call_fastpath+0x16/0x1b

This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.

Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.

The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.

This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.

* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.

* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.

* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.

* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:42:16 +0800
bd87b5898 block: drop @tsk from attempt_plug_merge() and explain sync rules ... Browse Code »

attempt_plug_merge() accesses elevator without holding queue_lock and
may call into ->elevator_bio_merge_fn(). The elvator is guaranteed to
be valid because it's accessed iff the plugged list has requests and
elevator is never exited with live requests, so as long as the
elevator method can deal with unlocked access, this is safe.

Explain the sync rules around attempt_plug_merge() and drop the
unnecessary @tsk parameter.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:33:08 +0800
da8303c63 block: make get_request[_wait]() fail if queue is dead ... Browse Code »

Currently get_request[_wait]() allocates request whether queue is dead
or not. This patch makes get_request[_wait]() return NULL if @q is
dead. blk_queue_bio() is updated to fail the submitted bio if request
allocation fails. While at it, add docbook comments for
get_request[_wait]().

Note that the current code has rather unclear (there are spurious DEAD
tests scattered around) assumption that the owner of a queue
guarantees that no request travels block layer if the queue is dead
and this patch in itself doesn't change much; however, this will allow
fixing the broken assumption in the next patch.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:33:05 +0800
bc16a4f93 block: reorganize throtl_get_tg() and blk_throtl_bio() ... Browse Code »

blk_throtl_bio() and throtl_get_tg() have rather unusual interface.

* throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
and drops queue_lock in the latter case. Different locking context
depending on return value is error-prone and DEAD state is scheduled
to be protected by queue_lock anyway. Move DEAD check inside
queue_lock and return valid tg or NULL.

* blk_throtl_bio() indicates return status both with its return value
and in/out param **@bio. The former is used to indicate whether
queue is found to be dead during throtl processing. The latter
whether the bio is throttled.

There's no point in returning DEAD check result from
blk_throtl_bio(). The queue can die after blk_throtl_bio() is
finished but before make_request_fn() grabs queue lock.

Make it take *@bio instead and return boolean result indicating
whether the request is throttled or not.

This patch doesn't cause any visible functional difference.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:33:01 +0800
e3c78ca52 block: reorganize queue draining ... Browse Code »

Reorganize queue draining related code in preparation of queue exit
changes.

* Factor out actual draining from elv_quiesce_start() to
blk_drain_queue().

* Make elv_quiesce_start/end() responsible for their own locking.

* Replace open-coded ELVSWITCH clearing in elevator_switch() with
elv_quiesce_end().

This patch doesn't cause any visible functional difference.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:32:38 +0800
75eb6c372 block: pass around REQ_* flags instead of broken down booleans during request alloc/free ... Browse Code »

blk_alloc_request() and freed_request() take different combinations of
REQ_* @flags, @priv and @is_sync when @flags is superset of the latter
two. Make them take @flags only. This cleans up the code a bit and
will ease updating allocation related REQ_* flags.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:31:22 +0800
5c04b426f Merge branch 'v3.1-rc10' into for-3.2/core ... Browse Code »

Conflicts:
block/blk-core.c
include/linux/blkdev.h

Signed-off-by: Jens Axboe

Jens Axboe
2011-10-19 20:30:42 +0800

28 Sep, 2011

1 commit

777eb1bf1 block: Free queue resources at blk_release_queue() ... Browse Code »
44

A kernel crash is observed when a mounted ext3/ext4 filesystem is
physically removed. The problem is that blk_cleanup_queue() frees up
some resources eg by calling elevator_exit(), which are not checked for
in normal operation. So we should rather move these calls to the
destructor function blk_release_queue() as at that point all remaining
references are gone. However, in doing so we have to ensure that any
externally supplied queue_lock is disconnected as the driver might free
up the lock after the call of blk_cleanup_queue(),

Signed-off-by: Hannes Reinecke
Signed-off-by: Jens Axboe

Hannes Reinecke
2011-09-28 22:07:01 +0800

21 Sep, 2011

1 commit

75df71362 block: document blk-plug ... Browse Code »

Thus spake Andrew Morton:

"And I have the usual maintainability whine. If someone comes up to
vmscan.c and sees it calling blk_start_plug(), how are they supposed to
work out why that call is there? They go look at the blk_start_plug()
definition and it is undocumented. I think we can do better than this?"

Adapted from the LWN article - http://lwn.net/Articles/438256/ by Jens
Axboe and from an earlier attempt by Shaohua Li to document blk-plug.

[akpm@linux-foundation.org: grammatical and spelling tweaks]
Signed-off-by: Suresh Jayaraman
Cc: Shaohua Li
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Jens Axboe

Suresh Jayaraman
2011-09-21 16:00:16 +0800

15 Sep, 2011

1 commit

27a84d54c block: refactor generic_make_request ... Browse Code »

Move all the checks performed on a bio into a new helper, and call it as
soon as bio is submitted even if it is a re-submission from ->make_request.

We explicitly mark the new helper as beeing non-inlined as the stack
usage for printing the block device name in the failure case is quite
high and this a patch where we have to be extremely conservative about
stack usage.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2011-09-15 20:01:40 +0800

12 Sep, 2011

3 commits

5a7bbad27 block: remove support for bio remapping from ->make_request ... Browse Code »
86

There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig
Acked-by: NeilBrown
Signed-off-by: Jens Axboe

Christoph Hellwig
2011-09-12 18:12:01 +0800
c20e8de27 block: rename __make_request() to blk_queue_bio() ... Browse Code »

Now that it's exported, lets put it in a more sane namespace.

Signed-off-by: Jens Axboe

Jens Axboe
2011-09-12 18:08:31 +0800
166e1f901 block: export __make_request ... Browse Code »

Avoid the hacks need for request based device mappers currently by simply
exporting the symbol instead of trying to get it through the back door.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2011-09-12 18:08:27 +0800

24 Aug, 2011

2 commits

56ebdaf2f block: simplify force plug flush code a little bit ... Browse Code »

Cleaning up the code a little bit. attempt_plug_merge() traverses the plug
list anyway, we can do the request counting there, so stack size is reduced
a little bit.
The motivation here is I suspect if we should count the requests for each
queue (task could handle multiple disks in the meantime), but my test doesn't
show it's worthy doing. If somebody proves we should do it, below change
will make that more easier.

Signed-off-by: Shaohua Li
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2011-08-24 22:04:34 +0800
a63271627 block: change force plug flush call order ... Browse Code »

Do blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request at the tail. New
request can't be merged to existing requests, but later new requests might
be merged with this new one. If blk_flush_plug_list() is done later, the
merge doesn't happen.
Believe it or not, this fixes a 10% regression running sysbench workload.

Signed-off-by: Shaohua Li
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2011-08-24 22:04:32 +0800

20 Aug, 2011

1 commit

5ccc38740 Merge branch 'for-linus' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
Revert "cfq: Remove special treatment for metadata rqs."
block: fix flush machinery for stacking drivers with differring flush flags
block: improve rq_affinity placement
blktrace: add FLUSH/FUA support
Move some REQ flags to the common bio/request area
allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
xen/blkback: Make description more obvious.
cfq-iosched: Add documentation about idling
block: Make rq_affinity = 1 work as expected
block: swim3: fix unterminated of_device_id table
block/genhd.c: remove useless cast in diskstats_show()
drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
bsg-lib: add module.h include
cfq-iosched: Reduce linked group count upon group destruction
blk-throttle: correctly determine sync bio
loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
loop: add management interface for on-demand device allocation
loop: replace linked list of allocated devices with an idr index
...

Linus Torvalds
2011-08-20 01:47:07 +0800

16 Aug, 2011

1 commit

4853abaae block: fix flush machinery for stacking drivers with differring flush flags ... Browse Code »

Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:

static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;

while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}

Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}

So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.

The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.

In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.

I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.

Cheers,
Jeff

Signed-off-by: Jeff Moyer
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Jeff Moyer
2011-08-16 03:37:25 +0800

04 Aug, 2011

1 commit

dd48c085c fault-injection: add ability to export fault_attr in arbitrary directory ... Browse Code »

init_fault_attr_dentries() is used to export fault_attr via debugfs.
But it can only export it in debugfs root directory.

Per Forlin is working on mmc_fail_request which adds support to inject
data errors after a completed host transfer in MMC subsystem.

The fault_attr for mmc_fail_request should be defined per mmc host and
export it in debugfs directory per mmc host like
/sys/kernel/debug/mmc0/mmc_fail_request.

init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
introduces fault_create_debugfs_attr() which is able to create a
directory in the arbitrary directory and replace
init_fault_attr_dentries().

[akpm@linux-foundation.org: extraneous semicolon, per Randy]
Signed-off-by: Akinobu Mita
Tested-by: Per Forlin
Cc: Jens Axboe
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Matt Mackall
Cc: Randy Dunlap
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-08-04 08:25:20 +0800

27 Jul, 2011

1 commit

b2c9cd379 fail_make_request: cleanup should_fail_request ... Browse Code »

This changes should_fail_request() to more usable wrapper function of
should_fail(). It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in
the middle of a function.

Signed-off-by: Akinobu Mita
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-07-27 07:49:46 +0800

26 Jul, 2011

2 commits

11ccf116d block: fix warning with calling smp_processor_id() in preemptible section ... Browse Code »

After commit 5757a6d7 introduced an unsafe calling of
smp_processor_id(), with preempt debuggin turned on we spew a lot of:

BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514
caller is __make_request+0x1b8/0x308
[] (unwind_backtrace+0x0/0xe8) from [] (debug_smp_processor_id+0xbc/0xf0)
[] (debug_smp_processor_id+0xbc/0xf0) from [] (__make_request+0x1b8/0x308)
[] (__make_request+0x1b8/0x308) from [] (generic_make_request+0x4dc/0x558)
[] (generic_make_request+0x4dc/0x558) from [] (submit_bio+0x114/0x138)
[] (submit_bio+0x114/0x138) from [] (submit_bh+0x148/0x16c)
[] (submit_bh+0x148/0x16c) from [] (__sync_dirty_buffer+0x88/0xd8)
[] (__sync_dirty_buffer+0x88/0xd8) from [] (journal_commit_transaction+0x1198/0x1688)
[] (journal_commit_transaction+0x1198/0x1688) from [] (kjournald+0xb4/0x224)
[] (kjournald+0xb4/0x224) from [] (kthread+0x8c/0x94)
[] (kthread+0x8c/0x94) from [] (kernel_thread_exit+0x0/0x8)

Fix this by just using raw_smp_processor_id(), it's just a hint
after all. There's no pinning of the CPU or accessing per-cpu
structures involved.

Reported-by: Ming Lei
Signed-off-by: Jens Axboe

Jens Axboe
2011-07-26 21:01:15 +0800
096a705bb Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
block: strict rq_affinity
backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
block: fix patch import error in max_discard_sectors check
block: reorder request_queue to remove 64 bit alignment padding
CFQ: add think time check for group
CFQ: add think time check for service tree
CFQ: move think time check variables to a separate struct
fixlet: Remove fs_excl from struct task.
cfq: Remove special treatment for metadata rqs.
block: document blk_plug list access
block: avoid building too big plug list
compat_ioctl: fix make headers_check regression
block: eliminate potential for infinite loop in blkdev_issue_discard
compat_ioctl: fix warning caused by qemu
block: flush MEDIA_CHANGE from drivers on close(2)
blk-throttle: Make total_nr_queued unsigned
block: Add __attribute__((format(printf...) and fix fallout
fs/partitions/check.c: make local symbols static
block:remove some spare spaces in genhd.c
block:fix the comment error in blkdev.h
...

Linus Torvalds
2011-07-26 01:33:36 +0800

24 Jul, 2011

1 commit

5757a6d76 block: strict rq_affinity ... Browse Code »

Some systems benefit from completions always being steered to the strict
requester cpu rather than the looser "per-socket" steering that
blk_cpu_to_group() attempts by default. This is because the first
CPU in the group mask ends up being completely overloaded with work,
while the others (including the original submitter) has power left
to spare.

Allow the strict mode to be set by writing '2' to the sysfs control
file. This is identical to the scheme used for the nomerges file,
where '2' is a more aggressive setting than just being turned on.

echo 2 > /sys/block//queue/rq_affinity

Cc: Christoph Hellwig
Cc: Roland Dreier
Tested-by: Dave Jiang
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Dan Williams
2011-07-24 02:44:25 +0800

22 Jul, 2011

1 commit

bfe159a51 [SCSI] fix crash in scsi_dispatch_cmd() ... Browse Code »
1

USB surprise removal of sr is triggering an oops in
scsi_dispatch_command(). What seems to be happening is that USB is
hanging on to a queue reference until the last close of the upper
device, so the crash is caused by surprise remove of a mounted CD
followed by attempted unmount.

The problem is that USB doesn't issue its final commands as part of
the SCSI teardown path, but on last close when the block queue is long
gone. The long term fix is probably to make sr do the teardown in the
same way as sd (so remove all the lower bits on ejection, but keep the
upper disk alive until last close of user space). However, the
current oops can be simply fixed by not allowing any commands to be
sent to a dead queue.

Cc: stable@kernel.org
Signed-off-by: James Bottomley

James Bottomley
2011-07-22 05:21:18 +0800

08 Jul, 2011

1 commit

55c022bbd block: avoid building too big plug list ... Browse Code »

When I test fio script with big I/O depth, I found the total throughput drops
compared to some relative small I/O depth. The reason is the thread accumulates
big requests in its plug list and causes some delays (surely this depends
on CPU speed).
I thought we'd better have a threshold for requests. When a threshold reaches,
this means there is no request merge and queue lock contention isn't severe
when pushing per-task requests to queue, so the main advantages of blk plug
don't exist. We can force a plug list flush in this case.
With this, my test throughput actually increases and almost equals to small
I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
for big I/O depth.
The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
reduce lock contention to me. But I'm open here, 32 is ok in my test too.

Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2011-07-08 14:19:20 +0800

27 May, 2011

2 commits

d86e0e83b block: export blk_{get,put}_queue() ... Browse Code »

We need them in SCSI to fix a bug, but currently they are not
exported to modules. Export them.

Signed-off-by: Jens Axboe

Jens Axboe
2011-05-27 13:45:45 +0800
700c4f332 block: remove unused variable in bio_attempt_front_merge() ... Browse Code »

sector is never read inside the function.

Signed-off-by: Luca Tettamanti
Signed-off-by: Jens Axboe

Luca Tettamanti
2011-05-27 03:07:26 +0800

23 May, 2011

1 commit

95cf3dd9d block: call elv_bio_merged() when merged ... Browse Code »

Commit 73c101011926 ("block: initial patch for on-stack per-task plugging")
removed calls to elv_bio_merged() when @bio merged with @req. Re-add them.

This in turn will update merged stats in associated group. That
should be safe as long as request has got reference to the blkio_group.

Signed-off-by: Namhyung Kim
Cc: Divyesh Shah
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-23 16:02:19 +0800

21 May, 2011

2 commits

771949d03 block: get rid of on-stack plugging debug checks ... Browse Code »

We don't need them anymore, so kill:

- REQ_ON_PLUG checks in various places
- !rq_mergeable() check in plug merging

Signed-off-by: Jens Axboe

Jens Axboe
2011-05-21 02:52:16 +0800
f469a7b4d blk-cgroup: Allow sleeping while dynamically allocating a group ... Browse Code »

Currently, all the cfq_group or throtl_group allocations happen while
we are holding ->queue_lock and sleeping is not allowed.

Soon, we will move to per cpu stats and also need to allocate the
per group stats. As one can not call alloc_percpu() from atomic
context as it can sleep, we need to drop ->queue_lock, allocate the
group, retake the lock and continue processing.

In throttling code, I check the queue DEAD flag again to make sure
that driver did not call blk_cleanup_queue() in the mean time.

Signed-off-by: Vivek Goyal
Signed-off-by: Jens Axboe

Vivek Goyal
2011-05-21 02:34:52 +0800

18 May, 2011

1 commit

3ec717b7c block: don't delay blk_run_queue_async ... Browse Code »

Let's check a scenario:
1. blk_delay_queue(q, SCSI_QUEUE_DELAY);
2. blk_run_queue_async();
the second one will became a noop, because q->delay_work already has
WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after
SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed
work runs immediately.

Fix this by doing a cancel on potentially pending delayed work
before queuing an immediate run of the workqueue.

Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2011-05-18 18:24:03 +0800

19 Apr, 2011

2 commits

d350e6b6e block: remove stale kerneldoc member from __blk_run_queue() ... Browse Code »

We don't pass in a 'force_kblockd' anymore, get rid of the
stsale comment.

Reported-by: Mike Snitzer
Signed-off-by: Jens Axboe

Jens Axboe
2011-04-19 19:34:14 +0800
c21e6beba block: get rid of QUEUE_FLAG_REENTER ... Browse Code »

We are currently using this flag to check whether it's safe
to call into ->request_fn(). If it is set, we punt to kblockd.
But we get a lot of false positives and excessive punts to
kblockd, which hurts performance.

The only real abuser of this infrastructure is SCSI. So export
the async queue run and convert SCSI over to use that. There's
room for improvement in that SCSI need not always use the async
call, but this fixes our performance issue and they can fix that
up in due time.

Signed-off-by: Jens Axboe

Jens Axboe
2011-04-19 19:32:46 +0800