Eric Lee / smarc-fsl-linux-kernel

01 Mar, 2013

1 commit

ee89f8125 Merge branch 'for-3.9/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO core bits from Jens Axboe:
"Below are the core block IO bits for 3.9. It was delayed a few days
since my workstation kept crashing every 2-8h after pulling it into
current -git, but turns out it is a bug in the new pstate code (divide
by zero, will report separately). In any case, it contains:

- The big cfq/blkcg update from Tejun and and Vivek.

- Additional block and writeback tracepoints from Tejun.

- Improvement of the should sort (based on queues) logic in the plug
flushing.

- _io() variants of the wait_for_completion() interface, using
io_schedule() instead of schedule() to contribute to io wait
properly.

- Various little fixes.

You'll get two trivial merge conflicts, which should be easy enough to
fix up"

Fix up the trivial conflicts due to hlist traversal cleanups (commit
b67bfe0d42ca: "hlist: drop the node parameter from iterators").

* 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
block: remove redundant check to bd_openers()
block: use i_size_write() in bd_set_size()
cfq: fix lock imbalance with failed allocations
drivers/block/swim3.c: fix null pointer dereference
block: don't select PERCPU_RWSEM
block: account iowait time when waiting for completion of IO request
sched: add wait_for_completion_io[_timeout]
writeback: add more tracepoints
block: add block_{touch|dirty}_buffer tracepoint
buffer: make touch_buffer() an exported function
block: add @req to bio_{front|back}_merge tracepoints
block: add missing block_bio_complete() tracepoint
block: Remove should_sort judgement when flush blk_plug
block,elevator: use new hashtable implementation
cfq-iosched: add hierarchical cfq_group statistics
cfq-iosched: collect stats from dead cfqgs
cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
block: RCU free request_queue
blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
...

Linus Torvalds
2013-03-01 04:52:24 +0800

22 Feb, 2013

1 commit

ffecfd1a7 block: optionally snapshot page contents to provide stable pages during write ... Browse Code »

This provides a band-aid to provide stable page writes on jbd without
needing to backport the fixed locking and page writeback bit handling
schemes of jbd2. The band-aid works by using bounce buffers to snapshot
page contents instead of waiting.

For those wondering about the ext3 bandage -- fixing the jbd locking
(which was done as part of ext4dev years ago) is a lot of surgery, and
setting PG_writeback on data pages when we actually hold the page lock
dropped ext3 performance by nearly an order of magnitude. If we're
going to migrate iscsi and raid to use stable page writes, the
complaints about high latency will likely return. We might as well
centralize their page snapshotting thing to one place.

Signed-off-by: Darrick J. Wong
Tested-by: Andy Lutomirski
Cc: Adrian Hunter
Cc: Artem Bityutskiy
Reviewed-by: Jan Kara
Cc: Joel Becker
Cc: Mark Fasheh
Cc: Steven Whitehouse
Cc: Jens Axboe
Cc: Eric Van Hensbergen
Cc: Ron Minnich
Cc: Latchesar Ionkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darrick J. Wong
2013-02-22 09:22:20 +0800

14 Jan, 2013

2 commits

8c1cf6bb0 block: add @req to bio_{front|back}_merge tracepoints ... Browse Code »

bio_{front|back}_merge tracepoints report a bio merging into an
existing request but didn't specify which request the bio is being
merged into. Add @req to it. This makes it impossible to share the
event template with block_bio_queue - split it out.

@req isn't used or exported to userland at this point and there is no
userland visible behavior change. Later changes will make use of the
extra parameter.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2013-01-14 22:00:36 +0800
3a366e614 block: add missing block_bio_complete() tracepoint ... Browse Code »

bio completion didn't kick block_bio_complete TP. Only dm was
explicitly triggering the TP on IO completion. This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.

This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.

* Explicit trace_block_bio_complete() invocation removed from dm and
the trace point is unexported.

* @rq dropped from trace_block_bio_complete(). bios may fly around
w/o queue associated. Verifying and accessing the assocaited queue
belongs to TP probes.

* blktrace now gets both request and bio completions. Make it ignore
bio completions if request completion path is happening.

This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.

v2: With this change, block_bio_complete TP could be invoked on sg
commands which have bio's with %NULL bi_bdev. Update TP
assignment code to check whether bio->bi_bdev is %NULL before
dereferencing.

Signed-off-by: Tejun Heo
Original-patch-by: Namhyung Kim
Cc: Tejun Heo
Cc: Steven Rostedt
Cc: Alasdair Kergon
Cc: dm-devel@redhat.com
Cc: Neil Brown
Signed-off-by: Jens Axboe

Tejun Heo
2013-01-14 22:00:36 +0800

11 Jan, 2013

1 commit

422765c26 block: Remove should_sort judgement when flush blk_plug ... Browse Code »

In commit 975927b942c932,it add blk_rq_pos to sort rq when flushing.
Although this commit was used for the situation which blk_plug handled
multi devices on the same time like md device.
I think there must be some situations like this but only single
device.
So remove the should_sort judgement.
Because the parameter should_sort is only for this purpose,it can delete
should_sort from blk_plug.

CC: Shaohua Li
Signed-off-by: Jianpeng Ma
Signed-off-by: Jens Axboe

Jianpeng Ma
2013-01-11 21:46:09 +0800

15 Dec, 2012

1 commit

cbae8d45d block: export block_unplug tracepoint ... Browse Code »

This allows stacked devices (like md/raid5) to provide blktrace
tracing, including unplug events.

Reported-by: Fengguang Wu
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2012-12-15 03:49:27 +0800

06 Dec, 2012

5 commits

24faf6f60 block: Make blk_cleanup_queue() wait until request_fn finished ... Browse Code »

Some request_fn implementations, e.g. scsi_request_fn(), unlock
the queue lock internally. This may result in multiple threads
executing request_fn for the same queue simultaneously. Keep
track of the number of active request_fn calls and make sure that
blk_cleanup_queue() waits until all active request_fn invocations
have finished. A block driver may start cleaning up resources
needed by its request_fn as soon as blk_cleanup_queue() finished,
so blk_cleanup_queue() must wait for all outstanding request_fn
invocations to finish.

Signed-off-by: Bart Van Assche
Reported-by: Chanho Min
Cc: James Bottomley
Cc: Mike Christie
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:33:00 +0800
704605711 block: Avoid scheduling delayed work on a dead queue ... Browse Code »

Running a queue must continue after it has been marked dying until
it has been marked dead. So the function blk_run_queue_async() must
not schedule delayed work after blk_cleanup_queue() has marked a queue
dead. Hence add a test for that queue state in blk_run_queue_async()
and make sure that queue_unplugged() invokes that function with the
queue lock held. This avoids that the queue state can change after
it has been tested and before mod_delayed_work() is invoked. Drop
the queue dying test in queue_unplugged() since it is now
superfluous: __blk_run_queue() already tests whether or not the
queue is dead.

Signed-off-by: Bart Van Assche
Cc: Mike Christie
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:32:30 +0800
c246e80d8 block: Avoid that request_fn is invoked on a dead queue ... Browse Code »

A block driver may start cleaning up resources needed by its
request_fn as soon as blk_cleanup_queue() finished, so request_fn
must not be invoked after draining finished. This is important
when blk_run_queue() is invoked without any requests in progress.
As an example, if blk_drain_queue() and scsi_run_queue() run in
parallel, blk_drain_queue() may have finished all requests after
scsi_run_queue() has taken a SCSI device off the starved list but
before that last function has had a chance to run the queue.

Signed-off-by: Bart Van Assche
Cc: James Bottomley
Cc: Mike Christie
Cc: Chanho Min
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:32:01 +0800
807592a4f block: Let blk_drain_queue() caller obtain the queue lock ... Browse Code »

Let the caller of blk_drain_queue() obtain the queue lock to improve
readability of the patch called "Avoid that request_fn is invoked on
a dead queue".

Signed-off-by: Bart Van Assche
Acked-by: Tejun Heo
Cc: James Bottomley
Cc: Mike Christie
Cc: Jens Axboe
Cc: Chanho Min
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:30:59 +0800
3f3299d5c block: Rename queue dead flag ... Browse Code »

QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
stop. After this flag has been set queue draining starts. However,
during the queue draining phase it is still safe to invoke the
queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
flag.

This patch has been generated by running the following command
over the kernel source tree:

git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
-e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
include/linux/blkdev.h; \
sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
-e 's/Dead queue/A dying queue/' block/blk-core.c

Signed-off-by: Bart Van Assche
Acked-by: Tejun Heo
Cc: James Bottomley
Cc: Mike Christie
Cc: Jens Axboe
Cc: Chanho Min
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:30:58 +0800

10 Nov, 2012

1 commit

c304a51bf block: use NUMA_NO_NODE instead of -1 ... Browse Code »

Signed-off-by: Ezequiel Garcia

Modified by me to cover blk_init_queue() as well.

Signed-off-by: Jens Axboe

Ezequiel Garcia
2012-11-10 17:41:13 +0800

26 Oct, 2012

1 commit

975927b94 block: Add blk_rq_pos(rq) to sort rq when plushing ... Browse Code »

My workload is a raid5 which had 16 disks. And used our filesystem to
write using direct-io mode.

I used the blktrace to find those message:
8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5]
8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5]
8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5]
8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5]
8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5]
8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5]
8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5]
8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5]
8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5]
8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5]
8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5]
8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5]
8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5]
8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5]
8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5]
8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5]
8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5]
8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5]
8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5]
8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5]
8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5]
8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5]
8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5]
8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5]
8,16 0 0 2.453853661 0 m N cfq2579 insert_request
8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453854439 0 m N cfq2579 insert_request
8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2
8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request
8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1
8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5]
8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1
8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert
8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request
8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2
8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5]
8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0]
8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0
8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0]
8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0
8,16 0 0 2.454795160 0 m N cfq schedule dispatch

From above messages,we can find rq[W 7493144 + 104] and rq[W
7493120 + 24] do not merge.
Because the bio order is:
8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5]
8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5]
8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5]
8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5]
The bio(7493144) first and bio(7493120) later.So the subsequent
bios will be divided into two parts.
When flushing plug-list,because elv_attempt_insert_merge only support
backmerge,not supporting frontmerge.
So rq[7493120 + 24] can't merge with rq[7493144 + 104].

From my test,i found those situation can count 25% in our system.
Using this patch, there is no this situation.

Signed-off-by: Jianpeng Ma
CC:Shaohua Li
Signed-off-by: Jens Axboe

Jianpeng Ma
2012-10-26 03:58:17 +0800

11 Oct, 2012

1 commit

ce40be7a8 Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO update from Jens Axboe:
"Core block IO bits for 3.7. Not a huge round this time, it contains:

- First series from Kent cleaning up and generalizing bio allocation
and freeing.

- WRITE_SAME support from Martin.

- Mikulas patches to prevent O_DIRECT crashes when someone changes
the block size of a device.

- Make bio_split() work on data-less bio's (like trim/discards).

- A few other minor fixups."

Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
Morton. It is due to the VM no longer using a prio-tree (see commit
6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

So make set_blocksize() use mapping_mapped() instead of open-coding the
internal VM knowledge that has changed.

* 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
block: makes bio_split support bio without data
scatterlist: refactor the sg_nents
scatterlist: add sg_nents
fs: fix include/percpu-rwsem.h export error
percpu-rw-semaphore: fix documentation typos
fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
blockdev: turn a rw semaphore into a percpu rw semaphore
Fix a crash when block device is read and block size is changed at the same time
block: fix request_queue->flags initialization
block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
block: ioctl to zero block ranges
block: Make blkdev_issue_zeroout use WRITE SAME
block: Implement support for WRITE SAME
block: Consolidate command flag and queue limit checks for merges
block: Clean up special command handling logic
block/blk-tag.c: Remove useless kfree
block: remove the duplicated setting for congestion_threshold
block: reject invalid queue attribute values
block: Add bio_clone_bioset(), bio_clone_kmalloc()
block: Consolidate bio_alloc_bioset(), bio_kmalloc()
...

Linus Torvalds
2012-10-11 08:04:23 +0800

03 Oct, 2012

1 commit

033d9959e Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue changes from Tejun Heo:
"This is workqueue updates for v3.7-rc1. A lot of activities this
round including considerable API and behavior cleanups.

* delayed_work combines a timer and a work item. The handling of the
timer part has always been a bit clunky leading to confusing
cancelation API with weird corner-case behaviors. delayed_work is
updated to use new IRQ safe timer and cancelation now works as
expected.

* Another deficiency of delayed_work was lack of the counterpart of
mod_timer() which led to cancel+queue combinations or open-coded
timer+work usages. mod_delayed_work[_on]() are added.

These two delayed_work changes make delayed_work provide interface
and behave like timer which is executed with process context.

* A work item could be executed concurrently on multiple CPUs, which
is rather unintuitive and made flush_work() behavior confusing and
half-broken under certain circumstances. This problem doesn't
exist for non-reentrant workqueues. While non-reentrancy check
isn't free, the overhead is incurred only when a work item bounces
across different CPUs and even in simulated pathological scenario
the overhead isn't too high.

All workqueues are made non-reentrant. This removes the
distinction between flush_[delayed_]work() and
flush_[delayed_]_work_sync(). The former is now as strong as the
latter and the specified work item is guaranteed to have finished
execution of any previous queueing on return.

* In addition to the various bug fixes, Lai redid and simplified CPU
hotplug handling significantly.

* Joonsoo introduced system_highpri_wq and used it during CPU
hotplug.

There are two merge commits - one to pull in IRQ safe timer from
tip/timers/core and the other to pull in CPU hotplug fixes from
wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
workqueue: remove @delayed from cwq_dec_nr_in_flight()
workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
workqueue: use __cpuinit instead of __devinit for cpu callbacks
workqueue: rename manager_mutex to assoc_mutex
workqueue: WORKER_REBIND is no longer necessary for idle rebinding
workqueue: WORKER_REBIND is no longer necessary for busy rebinding
workqueue: reimplement idle worker rebinding
workqueue: deprecate __cancel_delayed_work()
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
workqueue: use mod_delayed_work() instead of __cancel + queue
workqueue: use irqsafe timer for delayed_work
workqueue: clean up delayed_work initializers and add missing one
workqueue: make deferrable delayed_work initializer names consistent
workqueue: cosmetic whitespace updates for macro definitions
workqueue: deprecate system_nrt[_freezable]_wq
workqueue: deprecate flush[_delayed]_work_sync()
...

Linus Torvalds
2012-10-03 00:54:49 +0800

21 Sep, 2012

2 commits

60ea8226c block: fix request_queue->flags initialization ... Browse Code »

A queue newly allocated with blk_alloc_queue_node() has only
QUEUE_FLAG_BYPASS set. For request-based drivers,
blk_init_allocated_queue() is called and q->queue_flags is overwritten
with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
initial bypass is still in effect.

In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
instead of overwriting.

Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-09-21 21:33:12 +0800
749fefe67 block: lift the initial queue bypass mode on blk_register_queue() instead of blk… ... Browse Code »

…_init_allocated_queue()

b82d4b197c ("blkcg: make request_queue bypassing on allocation") made
request_queues bypassed on allocation to avoid switching on and off
bypass mode on a queue being initialized. Some drivers allocate and
then destroy a lot of queues without fully initializing them and
incurring bypass latency overhead on each of them could add upto
significant overhead.

Unfortunately, blk_init_allocated_queue() is never used by queues of
bio-based drivers, which means that all bio-based driver queues are in
bypass mode even after initialization and registration complete
successfully.

Due to the limited way request_queues are used by bio drivers, this
problem is hidden pretty well but it shows up when blk-throttle is
used in combination with a bio-based driver. Trying to configure
(echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
indefinitely in blkg_conf_prep() waiting for bypass mode to end.

This patch moves the initial blk_queue_bypass_end() call from
blk_init_allocated_queue() to blk_register_queue() which is called for
any userland-visible queues regardless of its type.

I believe this is correct because I don't think there is any block
driver which needs or wants working elevator and blk-cgroup on a queue
which isn't visible to userland. If there are such users, we need a
different solution.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
Cc: stable@vger.kernel.org
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Tejun Heo
2012-09-21 21:32:57 +0800

20 Sep, 2012

3 commits

4363ac7c1 block: Implement support for WRITE SAME ... Browse Code »

The WRITE SAME command supported on some SCSI devices allows the same
block to be efficiently replicated throughout a block range. Only a
single logical block is transferred from the host and the storage device
writes the same data to all blocks described by the I/O.

This patch implements support for WRITE SAME in the block layer. The
blkdev_issue_write_same() function can be used by filesystems and block
drivers to replicate a buffer across a block range. This can be used to
efficiently initialize software RAID devices, etc.

Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Signed-off-by: Jens Axboe

Martin K. Petersen
2012-09-20 20:31:45 +0800
f31dc1cd4 block: Consolidate command flag and queue limit checks for merges ... Browse Code »

- blk_check_merge_flags() verifies that cmd_flags / bi_rw are
compatible. This function is called for both req-req and req-bio
merging.

- blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used
to query the maximum sector count for a given request or queue. The
calls will return the right value from the queue limits given the
type of command (RW, discard, write same, etc.)

Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Signed-off-by: Jens Axboe

Martin K. Petersen
2012-09-20 20:31:41 +0800
e2a60da74 block: Clean up special command handling logic ... Browse Code »

Remove special-casing of non-rw fs style requests (discard). The nomerge
flags are consolidated in blk_types.h, and rq_mergeable() and
bio_mergeable() have been modified to use them.

bio_is_rw() is used in place of bio_has_data() a few places. This is
done to to distinguish true reads and writes from other fs type requests
that carry a payload (e.g. write same).

Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Signed-off-by: Jens Axboe

Martin K. Petersen
2012-09-20 20:31:38 +0800

09 Sep, 2012

4 commits

e32463b2f block: remove the duplicated setting for congestion_threshold ... Browse Code »

Before call the blk_queue_congestion_threshold(),
the blk_queue_congestion_threshold() is already called at blk_queue_make_rquest().
Because this code is the duplicated, it has removed.

Signed-off-by: Jaehoon Chung
Signed-off-by: Kyungmin Park
Signed-off-by: Jens Axboe

Jaehoon Chung
2012-09-09 18:44:10 +0800
bf800ef18 block: Add bio_clone_bioset(), bio_clone_kmalloc() ... Browse Code »

Previously, there was bio_clone() but it only allocated from the fs bio
set; as a result various users were open coding it and using
__bio_clone().

This changes bio_clone() to become bio_clone_bioset(), and then we add
bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
the functionality the last patch adedd.

This will also help in a later patch changing how bio cloning works.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: NeilBrown
CC: Alasdair Kergon
CC: Boaz Harrosh
CC: Jeff Garzik
Acked-by: Jeff Garzik
Signed-off-by: Jens Axboe

Kent Overstreet
2012-09-09 16:35:39 +0800
4254bba17 block: Kill bi_destructor ... Browse Code »

Now that we've got generic code for freeing bios allocated from bio
pools, this isn't needed anymore.

This patch also makes bio_free() static, since without bi_destructor
there should be no need for it to be called anywhere else.

bio_free() is now only called from bio_put, so we can refactor those a
bit - move some code from bio_put() to bio_free() and kill the redundant
bio->bi_next = NULL.

v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz
v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL
v7: No #define BIO_KMALLOC_POOL anymore

Signed-off-by: Kent Overstreet
CC: Jens Axboe
Signed-off-by: Jens Axboe

Kent Overstreet
2012-09-09 16:35:39 +0800
1e2a410ff block: Ues bi_pool for bio_integrity_alloc() ... Browse Code »

Now that bios keep track of where they were allocated from,
bio_integrity_alloc_bioset() becomes redundant.

Remove bio_integrity_alloc_bioset() and drop bio_set argument from the
related functions and make them use bio->bi_pool.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: Martin K. Petersen
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Kent Overstreet
2012-09-09 16:35:38 +0800

31 Aug, 2012

1 commit

37d7b34f0 block: rate-limit the error message from failing commands ... Browse Code »

When performing a cable pull test w/ active stress I/O using fio over
a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
on the other, it is observed that the system becomes not usable due to
scsi-ml being busy printing the error messages for all the failing commands.
I don't believe this problem is specific to FCoE and these commands are
anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
the messages here to solve this issue.

v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
rate-limit per function. However, in this case, the failed i/o gets to
blk_end_request_err() and then blk_update_request(), which also has to
be rate-limited, as added in the v2 of this patch.

v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.

Signed-off-by: Yi Zou
Cc: www.Open-FCoE.org
Cc: Tomas Henzl
Cc:
Signed-off-by: Jens Axboe

Yi Zou
2012-08-31 07:26:25 +0800

22 Aug, 2012

2 commits

136b5721d workqueue: deprecate __cancel_delayed_work() ... Browse Code »

Now that cancel_delayed_work() can be safely called from IRQ handlers,
there's no reason to use __cancel_delayed_work(). Use
cancel_delayed_work() instead of __cancel_delayed_work() and mark the
latter deprecated.

Signed-off-by: Tejun Heo
Acked-by: Jens Axboe
Cc: Jiri Kosina
Cc: Roland Dreier
Cc: Tomi Valkeinen

Tejun Heo
2012-08-22 04:18:24 +0800
e7c2f9674 workqueue: use mod_delayed_work() instead of __cancel + queue ... Browse Code »

Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().

Most conversions are straight-forward except for the following.

* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
elaborate dancing around its delayed_work. Collapse it such that
linkwatch_work is queued for immediate execution if LW_URGENT and
existing timer is kept otherwise.

Signed-off-by: Tejun Heo
Cc: "David S. Miller"
Cc: Tomi Valkeinen

Tejun Heo
2012-08-22 04:18:24 +0800

31 Jul, 2012

3 commits

74018dc30 blk: pass from_schedule to non-request unplug functions. ... Browse Code »

This will allow md/raid to know why the unplug was called,
and will be able to act according - if !from_schedule it
is safe to perform tasks which could themselves schedule.

Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2012-07-31 15:08:15 +0800
2a7d5559b block: stack unplug ... Browse Code »

MD raid1 prepares to dispatch request in unplug callback. If make_request in
low level queue also uses unplug callback to dispatch request, the low level
queue's unplug callback will not be called. Recheck the callback list helps
this case.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

Shaohua Li
2012-07-31 15:08:15 +0800
9cbb17508 blk: centralize non-request unplug handling. ... Browse Code »

Both md and umem has similar code for getting notified on an
blk_finish_plug event.
Centralize this code in block/ and allow each driver to
provide its distinctive difference.

Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2012-07-31 15:08:14 +0800

27 Jun, 2012

1 commit

a051661ca blkcg: implement per-blkg request allocation ... Browse Code »

Currently, request_queue has one request_list to allocate requests
from regardless of blkcg of the IO being issued. When the unified
request pool is used up, cfq proportional IO limits become meaningless
- whoever grabs the next request being freed wins the race regardless
of the configured weights.

This can be easily demonstrated by creating a blkio cgroup w/ very low
weight, put a program which can issue a lot of random direct IOs there
and running a sequential IO from a different cgroup. As soon as the
request pool is used up, the sequential IO bandwidth crashes.

This patch implements per-blkg request_list. Each blkg has its own
request_list and any IO allocates its request from the matching blkg
making blkcgs completely isolated in terms of request allocation.

* Root blkcg uses the request_list embedded in each request_queue,
which was renamed to @q->root_rl from @q->rq. While making blkcg rl
handling a bit harier, this enables avoiding most overhead for root
blkcg.

* Queue fullness is properly per request_list but bdi isn't blkcg
aware yet, so congestion state currently just follows the root
blkcg. As writeback isn't aware of blkcg yet, this works okay for
async congestion but readahead may get the wrong signals. It's
better than blkcg completely collapsing with shared request_list but
needs to be improved with future changes.

* After this change, each block cgroup gets a full request pool making
resource consumption of each cgroup higher. This makes allowing
non-root users to create cgroups less desirable; however, note that
allowing non-root users to directly manage cgroups is already
severely broken regardless of this patch - each block cgroup
consumes kernel memory and skews IO weight (IO weights are not
hierarchical).

v2: queue-sysfs.txt updated and patch description udpated as suggested
by Vivek.

v3: blk_get_rl() wasn't checking error return from
blkg_lookup_create() and may cause oops on lookup failure. Fix it
by falling back to root_rl on blkg lookup failures. This problem
was spotted by Rakesh Iyer .

v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
request waitqueue". blk_drain_queue() now wakes up waiters on all
blkg->rl on the target queue.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Cc: Wu Fengguang
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-27 06:42:49 +0800

25 Jun, 2012

5 commits

5b788ce3e block: prepare for multiple request_lists ... Browse Code »

Request allocation is about to be made per-blkg meaning that there'll
be multiple request lists.

* Make queue full state per request_list. blk_*queue_full() functions
are renamed to blk_*rl_full() and takes @rl instead of @q.

* Rename blk_init_free_list() to blk_init_rl() and make it take @rl
instead of @q. Also add @gfp_mask parameter.

* Add blk_exit_rl() instead of destroying rl directly from
blk_release_queue().

* Add request_list->q and make request alloc/free functions -
blk_free_request(), [__]freed_request(), __get_request() - take @rl
instead of @q.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:52 +0800
8a5ecdd42 block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv ... Browse Code »

Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated
to use q->nr_rqs[] instead of q->rq.count[].

These counters separates queue-wide request statistics from the
request list and allow implementation of per-queue request allocation.

While at it, properly indent fields of struct request_list.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:51 +0800
7f4b35d15 block: allocate io_context upfront ... Browse Code »

Block layer very lazy allocation of ioc. It waits until the moment
ioc is absolutely necessary; unfortunately, that time could be inside
queue lock and __get_request() performs unlock - try alloc - retry
dancing.

Just allocate it up-front on entry to block layer. We're not saving
the rain forest by deferring it to the last possible moment and
complicating things unnecessarily.

This patch is to prepare for further updates to request allocation
path.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:50 +0800
a06e05e6a block: refactor get_request[_wait]() ... Browse Code »

Currently, there are two request allocation functions - get_request()
and get_request_wait(). The former tries to allocate a request once
and the latter keeps retrying until it succeeds. The latter wraps the
former and keeps retrying until allocation succeeds.

The combination of two functions deliver fallible non-wait allocation,
fallible wait allocation and unfailing wait allocation. However,
given that forward progress is guaranteed, fallible wait allocation
isn't all that useful and in fact nobody uses it.

This patch simplifies the interface as follows.

* get_request() is renamed to __get_request() and is only used by the
wrapper function.

* get_request_wait() is renamed to get_request(). It now takes
@gfp_mask and retries iff it contains %__GFP_WAIT.

This patch doesn't introduce any functional change and is to prepare
for further updates to request allocation path.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:49 +0800
a91a5ac68 mempool: add @gfp_mask to mempool_create_node() ... Browse Code »

mempool_create_node() currently assumes %GFP_KERNEL. Its only user,
blk_init_free_list(), is about to be updated to use other allocation
flags - add @gfp_mask argument to the function.

Signed-off-by: Tejun Heo
Cc: Andrew Morton
Cc: Hugh Dickins
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:47 +0800

15 Jun, 2012

2 commits

5e5cfac0c block: Mitigate lock unbalance caused by lock switching ... Browse Code »

Commit 777eb1bf15b8532c396821774bf6451e563438f5 disconnects externally
supplied queue_lock before blk_drain_queue(). Switching the lock would
introduce lock unbalance because theads which have taken the external
lock might unlock the internal lock in the during the queue drain. This
patch mitigate this by disconnecting the lock after the queue draining
since queue draining makes a lot of request_queue users go away.

However, please note, this patch only makes the problem less likely to
happen. Anyone who still holds a ref might try to issue a new request on
a dead queue after the blk_cleanup_queue() finishes draining, the lock
unbalance might still happen in this case.

=====================================
[ BUG: bad unlock balance detected! ]
3.4.0+ #288 Not tainted
-------------------------------------
fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
[] blk_queue_bio+0x2a2/0x380
but there are no more locks to release!

other info that might help us debug this:
1 lock held by fio/17706:
#0: (&(&vblk->lock)->rlock){......}, at: []
get_request_wait+0x19a/0x250

stack backtrace:
Pid: 17706, comm: fio Not tainted 3.4.0+ #288
Call Trace:
[] ? blk_queue_bio+0x2a2/0x380
[] print_unlock_inbalance_bug+0xf9/0x100
[] lock_release_non_nested+0x1df/0x330
[] ? dio_bio_end_aio+0x34/0xc0
[] ? bio_check_pages_dirty+0x85/0xe0
[] ? dio_bio_end_aio+0xb1/0xc0
[] ? blk_queue_bio+0x2a2/0x380
[] ? blk_queue_bio+0x2a2/0x380
[] lock_release+0xd9/0x250
[] _raw_spin_unlock_irq+0x23/0x40
[] blk_queue_bio+0x2a2/0x380
[] generic_make_request+0xca/0x100
[] submit_bio+0x76/0xf0
[] ? set_page_dirty_lock+0x3c/0x60
[] ? bio_set_pages_dirty+0x51/0x70
[] do_blockdev_direct_IO+0xbf8/0xee0
[] ? blkdev_get_block+0x80/0x80
[] __blockdev_direct_IO+0x55/0x60
[] ? blkdev_get_block+0x80/0x80
[] blkdev_direct_IO+0x57/0x60
[] ? blkdev_get_block+0x80/0x80
[] generic_file_aio_read+0x70e/0x760
[] ? __lock_acquire+0x215/0x5a0
[] ? aio_run_iocb+0x54/0x1a0
[] ? grab_cache_page_nowait+0xc0/0xc0
[] aio_rw_vect_retry+0x7c/0x1e0
[] ? aio_fsync+0x30/0x30
[] aio_run_iocb+0x66/0x1a0
[] do_io_submit+0x6f0/0xb80
[] ? trace_hardirqs_on_thunk+0x3a/0x3f
[] sys_io_submit+0x10/0x20
[] system_call_fastpath+0x16/0x1b

Changes since v2: Update commit log to explain how the code is still
broken even if we delay the lock switching after the drain.
Changes since v1: Update commit log as Tejun suggested.

Acked-by: Tejun Heo
Signed-off-by: Asias He
Signed-off-by: Jens Axboe

Asias He
2012-06-15 14:46:22 +0800
458f27a98 block: Avoid missed wakeup in request waitqueue ... Browse Code »
43

After hot-unplug a stressed disk, I found that rl->wait[] is not empty
while rl->count[] is empty and there are theads still sleeping on
get_request after the queue cleanup. With simple debug code, I found
there are exactly nr_sleep - nr_wakeup of theads in D state. So there
are missed wakeup.

$ dmesg | grep nr_sleep
[ 52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
$ vmstat 1
1 173 0 712640 24292 96172 0 0 0 0 419 757 0 0 0 100 0

To quote Tejun:

Ah, okay, freed_request() wakes up single waiter with the assumption
that after the wakeup there will at least be one successful allocation
which in turn will continue the wakeup chain until the wait list is
empty - ie. waiter wakeup is dependent on successful request
allocation happening after each wakeup. With queue marked dead, any
woken up waiter fails the allocation path, so the wakeup chaining is
lost and we're left with hung waiters. What we need is wake_up_all()
after drain completion.

This patch fixes the missed wakeup by waking up all the theads which
are sleeping on wait queue after queue drain.

Changes in v2: Drop waitqueue_active() optimization

Acked-by: Tejun Heo
Signed-off-by: Asias He

Fixed a bug by me, where stacked devices would oops on calling
blk_drain_queue() since ->rq.wait[] do not get initialized unless
it's a full queue setup.

Signed-off-by: Jens Axboe

Asias He
2012-06-15 14:45:25 +0800

01 May, 2012

1 commit

0b7877d4e Merge tag 'v3.4-rc5' into for-3.5/core ... Browse Code »

The core branch is behind driver commits that we want to build
on for 3.5, hence I'm pulling in a later -rc.

Linux 3.4-rc5

Conflicts:
Documentation/feature-removal-schedule.txt

Signed-off-by: Jens Axboe

Jens Axboe
2012-05-01 20:29:55 +0800

20 Apr, 2012

1 commit

aaf7c6806 block: fix elvpriv allocation failure handling ... Browse Code »

Request allocation is mempool backed to guarantee forward progress
under memory pressure; unfortunately, this property got broken while
adding elvpriv data. Failures during elvpriv allocation, including
ioc and icq creation failures, currently make get_request() fail as
whole. There's no forward progress guarantee for these allocations -
they may fail indefinitely under memory pressure stalling IO and
deadlocking the system.

This patch updates get_request() such that elvpriv allocation failure
doesn't make the whole function fail. If elvpriv allocation fails,
the allocation is degraded into !ELVPRIV. This will force the request
to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
failures should be rare (nothing is per-request) and anything is
better than deadlocking.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2012-04-20 16:06:40 +0800