Eric Lee / smarc-fsl-linux-kernel

11 Jan, 2013

1 commit

242d98f07 block,elevator: use new hashtable implementation ... Browse Code »

Switch elevator to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the elevator.

This also removes the dymanic allocation of the hash table. The size of the table is
constant so there's no point in paying the price of an extra dereference when accessing
it.

This patch depends on d9b482c ("hashtable: introduce a small and naive
hashtable") which was merged in v3.6.

Signed-off-by: Sasha Levin
Signed-off-by: Jens Axboe

Sasha Levin
2013-01-11 21:43:13 +0800

06 Dec, 2012

2 commits

c246e80d8 block: Avoid that request_fn is invoked on a dead queue ... Browse Code »

A block driver may start cleaning up resources needed by its
request_fn as soon as blk_cleanup_queue() finished, so request_fn
must not be invoked after draining finished. This is important
when blk_run_queue() is invoked without any requests in progress.
As an example, if blk_drain_queue() and scsi_run_queue() run in
parallel, blk_drain_queue() may have finished all requests after
scsi_run_queue() has taken a SCSI device off the starved list but
before that last function has had a chance to run the queue.

Signed-off-by: Bart Van Assche
Cc: James Bottomley
Cc: Mike Christie
Cc: Chanho Min
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:32:01 +0800
3f3299d5c block: Rename queue dead flag ... Browse Code »

QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
stop. After this flag has been set queue draining starts. However,
during the queue draining phase it is still safe to invoke the
queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
flag.

This patch has been generated by running the following command
over the kernel source tree:

git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
-e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
include/linux/blkdev.h; \
sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
-e 's/Dead queue/A dying queue/' block/blk-core.c

Signed-off-by: Bart Van Assche
Acked-by: Tejun Heo
Cc: James Bottomley
Cc: Mike Christie
Cc: Jens Axboe
Cc: Chanho Min
Signed-off-by: Jens Axboe

Bart Van Assche
2012-12-06 21:30:58 +0800

20 Sep, 2012

1 commit

e2a60da74 block: Clean up special command handling logic ... Browse Code »

Remove special-casing of non-rw fs style requests (discard). The nomerge
flags are consolidated in blk_types.h, and rq_mergeable() and
bio_mergeable() have been modified to use them.

bio_is_rw() is used in place of bio_has_data() a few places. This is
done to to distinguish true reads and writes from other fs type requests
that carry a payload (e.g. write same).

Signed-off-by: Martin K. Petersen
Acked-by: Mike Snitzer
Signed-off-by: Jens Axboe

Martin K. Petersen
2012-09-20 20:31:38 +0800

01 Aug, 2012

1 commit

80799fbb7 block: remove dead func declaration ... Browse Code »

__generic_unplug_device() function is removed with commit
7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to
remove the declaration at meantime. Here remove it.

Signed-off-by: Yuanhan Liu
Signed-off-by: Jens Axboe

Yuanhan Liu
2012-08-01 18:25:54 +0800

25 Jun, 2012

1 commit

5b788ce3e block: prepare for multiple request_lists ... Browse Code »

Request allocation is about to be made per-blkg meaning that there'll
be multiple request lists.

* Make queue full state per request_list. blk_*queue_full() functions
are renamed to blk_*rl_full() and takes @rl instead of @q.

* Rename blk_init_free_list() to blk_init_rl() and make it take @rl
instead of @q. Also add @gfp_mask parameter.

* Add blk_exit_rl() instead of destroying rl directly from
blk_release_queue().

* Add request_list->q and make request alloc/free functions -
blk_free_request(), [__]freed_request(), __get_request() - take @rl
instead of @q.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Acked-by: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-06-25 17:53:52 +0800

02 Apr, 2012

1 commit

959d851ca Merge branch 'for-3.5' of ../cgroup into block/for-3.5/core-merged ... Browse Code »

cgroup/for-3.5 contains the following changes which blk-cgroup needs
to proceed with the on-going cleanup.

* Dynamic addition and removal of cftypes to make config/stat file
handling modular for policies.

* cgroup removal update to not wait for css references to drain to fix
blkcg removal hang caused by cfq caching cfqgs.

Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the
following conflicts in block/blk-cgroup.c.

* 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
removal. Resolved by removing @subsys from all subsys methods.

* 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
controllers" conflicts with ->pre_destroy() and ->attach() updates
and removal of modular config. Resolved by dropping forward
declarations of the methods and applying updates to the relocated
blkio_subsys.

* 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
cftype interface" builds upon the previous item. Resolved by adding
->base_cftypes to the relocated blkio_subsys.

Signed-off-by: Tejun Heo

Tejun Heo
2012-04-02 03:55:00 +0800

07 Mar, 2012

3 commits

24acfc34f block: interface update for ioc/icq creation functions ... Browse Code »

Make the following interface updates to prepare for future ioc related
changes.

* create_io_context() returning ioc only works for %current because it
doesn't increment ref on the ioc. Drop @task parameter from it and
always assume %current.

* Make create_io_context_slowpath() return 0 or -errno and rename it
to create_task_io_context().

* Make ioc_create_icq() take @ioc as parameter instead of assuming
that of %current. The caller, get_request(), is updated to create
ioc explicitly and then pass it into ioc_create_icq().

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-03-07 04:27:24 +0800
5efd61135 blkcg: add blkcg_{init|drain|exit}_queue() ... Browse Code »

Currently block core calls directly into blk-throttle for init, drain
and exit. This patch adds blkcg_{init|drain|exit}_queue() which wraps
the blk-throttle functions. This is to give more control and
visiblity to blkcg core layer for proper layering. Further patches
will add logic common to blkcg policies to the functions.

While at it, collapse blk_throtl_release() into blk_throtl_exit().
There's no reason to keep them separate.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-03-07 04:27:23 +0800
d732580b4 block: implement blk_queue_bypass_start/end() ... Browse Code »

Rename and extend elv_queisce_start/end() to
blk_queue_bypass_start/end() which are exported and supports nesting
via @q->bypass_depth. Also add blk_queue_bypass() to test bypass
state.

This will be further extended and used for blkio_group management.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2012-03-07 04:27:21 +0800

01 Mar, 2012

1 commit

7e4d96099 Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: we'll queue up dependent patches.

Signed-off-by: Ingo Molnar

Ingo Molnar
2012-03-01 17:26:43 +0800

08 Feb, 2012

1 commit

050c8ea80 block: separate out blk_rq_merge_ok() and blk_try_merge() from elevator functions ... Browse Code »

blk_rq_merge_ok() is the elevator-neutral part of merge eligibility
test. blk_try_merge() determines merge direction and expects the
caller to have tested elv_rq_merge_ok() previously.

elv_rq_merge_ok() now wraps blk_rq_merge_ok() and then calls
elv_iosched_allow_merge(). elv_try_merge() is removed and the two
callers are updated to call elv_rq_merge_ok() explicitly followed by
blk_try_merge(). While at it, make rq_merge_ok() functions return
bool.

This is to prepare for plug merge update and doesn't introduce any
behavior change.

This is based on Jens' patch to skip elevator_allow_merge_fn() from
plug merge.

Signed-off-by: Tejun Heo
LKML-Reference:
Original-patch-by: Jens Axboe
Signed-off-by: Jens Axboe

Tejun Heo
2012-02-08 16:19:38 +0800

27 Jan, 2012

1 commit

39be35012 sched, block: Unify cache detection ... Browse Code »

The block layer has some code trying to determine if two CPUs share a
cache, the scheduler has a similar function. Expose the function used
by the scheduler and make the block layer use it, thereby removing the
block layers usage of CONFIG_SCHED* and topology bits.

Signed-off-by: Peter Zijlstra
Acked-by: Jens Axboe
Link: http://lkml.kernel.org/r/1327579450.2446.95.camel@twins

Peter Zijlstra
2012-01-27 20:28:48 +0800

14 Dec, 2011

9 commits

f1f8cc946 block, cfq: move icq creation and rq->elv.icq association to block core ... Browse Code »

Now block layer knows everything necessary to create and associate
icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
get_request() such that, if elevator_type->icq_size is set, requests
are automatically associated with their matching icq's before
elv_set_request(). io_context reference is also managed by block core
on request alloc/free.

* Only ioprio/cgroup changed handling remains from cfq_get_cic().
Collapsed into cfq_set_request().

* This removes queue kicking on icq allocation failure (for now). As
icq allocation failure is rare and the only effect of queue kicking
achieved was possibily accelerating queue processing, this change
shouldn't be noticeable.

There is a larger underlying problem. Unlike request allocation,
icq allocation is not guaranteed to succeed eventually after
retries. The number of icq is unbound and thus mempool can't be the
solution either. This effectively adds allocation dependency on
memory free path and thus possibility of deadlock.

This usually wouldn't happen because icq allocation is not a hot
path and, even when the condition triggers, it's highly unlikely
that none of the writeback workers already has icq.

However, this is still possible especially if elevator is being
switched under high memory pressure, so we better get it fixed.
Probably the only solution is just bypassing elevator and appending
to dispatch queue on any elevator allocation failure.

* Comment added to explain how icq's are managed and synchronized.

This completes cleanup of io_context interface.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:42 +0800
7e5a87944 block, cfq: move io_cq exit/release to blk-ioc.c ... Browse Code »

With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
blk-ioc too. The odd ->io_cq->exit/release() callbacks are replaced
with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
and q, and freeing automatically handled by blk-ioc. The elevator
operation only need to perform exit operation specific to the elevator
- in cfq's case, exiting the cfqq's.

Also, clearing of io_cq's on q detach is moved to block core and
automatically performed on elevator switch and q release.

Because the q io_cq points to might be freed before RCU callback for
the io_cq runs, blk-ioc code should remember to which cache the io_cq
needs to be freed when the io_cq is released. New field
io_cq->__rcu_icq_cache is added for this purpose. As both the new
field and rcu_head are used only after io_cq is released and the
q/ioc_node fields aren't, they are put into unions.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:42 +0800
47fdd4ca9 block, cfq: move io_cq lookup to blk-ioc.c ... Browse Code »

Now that all io_cq related data structures are in block core layer,
io_cq lookup can be moved from cfq-iosched.c to blk-ioc.c.

Lookup logic from cfq_cic_lookup() is moved to ioc_lookup_icq() with
parameter return type changes (cfqd -> request_queue, cfq_io_cq ->
io_cq) and cfq_cic_lookup() becomes thin wrapper around
cfq_cic_lookup().

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:42 +0800
22f746e23 block: remove elevator_queue->ops ... Browse Code »

elevator_queue->ops points to the same ops struct ->elevator_type.ops
is pointing to. The only effect of caching it in elevator_queue is
shorter notation - it doesn't save any indirect derefence.

Relocate elevator_type->list which used only during module init/exit
to the end of the structure, rename elevator_queue->elevator_type to
->type, and replace elevator_queue->ops with elevator_queue->type.ops.

This doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:41 +0800
f2dbd76a0 block, cfq: replace current_io_context() with create_io_context() ... Browse Code »

When called under queue_lock, current_io_context() triggers lockdep
warning if it hits allocation path. This is because io_context
installation is protected by task_lock which is not IRQ safe, so it
triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
deadlock warning.

Given the restriction, accessor + creator rolled into one doesn't work
too well. Drop current_io_context() and let the users access
task->io_context directly inside queue_lock combined with explicit
creation using create_io_context().

Future ioc updates will further consolidate ioc access and the create
interface will be unexported.

While at it, relocate ioc internal interface declarations in blk.h and
add section comments before and after.

This patch does not introduce functional change.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:40 +0800
09ac46c42 block: misc updates to blk_get_queue() ... Browse Code »

* blk_get_queue() is peculiar in that it returns 0 on success and 1 on
failure instead of 0 / -errno or boolean. Update it such that it
returns %true on success and %false on failure.

* Make sure the caller checks for the return value.

* Separate out __blk_get_queue() which doesn't check whether @q is
dead and put it in blk.h. This will be used later.

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:38 +0800
6e736be7f block: make ioc get/put interface more conventional and fix race on alloction ... Browse Code »

Ignoring copy_io() during fork, io_context can be allocated from two
places - current_io_context() and set_task_ioprio(). The former is
always called from local task while the latter can be called from
different task. The synchornization between them are peculiar and
dubious.

* current_io_context() doesn't grab task_lock() and assumes that if it
saw %NULL ->io_context, it would stay that way until allocation and
assignment is complete. It has smp_wmb() between alloc/init and
assignment.

* set_task_ioprio() grabs task_lock() for assignment and does
smp_read_barrier_depends() between "ioc = task->io_context" and "if
(ioc)". Unfortunately, this doesn't achieve anything - the latter
is not a dependent load of the former. ie, if ioc itself were being
dereferenced "ioc->xxx", it would mean something (not sure what tho)
but as the code currently stands, the dependent read barrier is
noop.

As only one of the the two test-assignment sequences is task_lock()
protected, the task_lock() can't do much about race between the two.
Nothing prevents current_io_context() and set_task_ioprio() allocating
its own ioc for the same task and overwriting the other's.

Also, set_task_ioprio() can race with exiting task and create a new
ioc after exit_io_context() is finished.

ioc get/put doesn't have any reason to be complex. The only hot path
is accessing the existing ioc of %current, which is simple to achieve
given that ->io_context is never destroyed as long as the task is
alive. All other paths can happily go through task_lock() like all
other task sub structures without impacting anything.

This patch updates ioc get/put so that it becomes more conventional.

* alloc_io_context() is replaced with get_task_io_context(). This is
the only interface which can acquire access to ioc of another task.
On return, the caller has an explicit reference to the object which
should be put using put_io_context() afterwards.

* The functionality of current_io_context() remains the same but when
creating a new ioc, it shares the code path with
get_task_io_context() and always goes through task_lock().

* get_io_context() now means incrementing ref on an ioc which the
caller already has access to (be that an explicit refcnt or implicit
%current one).

* PF_EXITING inhibits creation of new io_context and once
exit_io_context() is finished, it's guaranteed that both ioc
acquisition functions return %NULL.

* All users are updated. Most are trivial but
smp_read_barrier_depends() removal from cfq_get_io_context() needs a
bit of explanation. I suppose the original intention was to ensure
ioc->ioprio is visible when set_task_ioprio() allocates new
io_context and installs it; however, this wouldn't have worked
because set_task_ioprio() doesn't have wmb between init and install.
There are other problems with this which will be fixed in another
patch.

* While at it, use NUMA_NO_NODE instead of -1 for wildcard node
specification.

-v2: Vivek spotted contamination from debug patch. Removed.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:38 +0800
a73f730d0 block, cfq: move cfqd->cic_index to q->id ... Browse Code »

cfq allocates per-queue id using ida and uses it to index cic radix
tree from io_context. Move it to q->id and allocate on queue init and
free on queue release. This simplifies cfq a bit and will allow for
further improvements of io context life-cycle management.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:37 +0800
34f6055c8 block: add blk_queue_dead() ... Browse Code »

There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
macro and use it.

This patch doesn't introduce any functional difference.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-12-14 07:33:37 +0800

19 Oct, 2011

4 commits

c9a929dde block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown ... Browse Code »

request_queue is refcounted but actually depdends on lifetime
management from the queue owner - on blk_cleanup_queue(), block layer
expects that there's no request passing through request_queue and no
new one will.

This is fundamentally broken. The queue owner (e.g. SCSI layer)
doesn't have a way to know whether there are other active users before
calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
guarantee that the queue is and would stay valid while it's holding a
reference.

With delay added in blk_queue_bio() before queue_lock is grabbed, the
following oops can be easily triggered when a device is removed with
in-flight IOs.

sd 0:0:1:0: [sdb] Stopping disk
ata1.01: disabled
general protection fault: 0000 [#1] PREEMPT SMP
CPU 2
Modules linked in:

Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
RIP: 0010:[] [] elv_rqhash_find+0x61/0x100
...
Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
...
Call Trace:
[] elv_merge+0x84/0xe0
[] blk_queue_bio+0xf4/0x400
[] generic_make_request+0xca/0x100
[] submit_bio+0x74/0x100
[] dio_bio_submit+0xbc/0xc0
[] __blockdev_direct_IO+0x92e/0xb40
[] blkdev_direct_IO+0x57/0x60
[] generic_file_aio_read+0x6d5/0x760
[] do_sync_read+0xda/0x120
[] vfs_read+0xc5/0x180
[] sys_pread64+0x9a/0xb0
[] system_call_fastpath+0x16/0x1b

This happens because blk_queue_cleanup() destroys the queue and
elevator whether IOs are in progress or not and DEAD tests are
sprinkled in the request processing path without proper
synchronization.

Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
is shutdown whether it has requests in it or not. Depending on
timing, it either oopses or throttled bios are lost putting tasks
which are waiting for bio completion into eternal D state.

The way it should work is having the usual clear distinction between
shutdown and release. Shutdown drains all currently pending requests,
marks the queue dead, and performs partial teardown of the now
unnecessary part of the queue. Even after shutdown is complete,
reference holders are still allowed to issue requests to the queue
although they will be immmediately failed. The rest of teardown
happens on release.

This patch makes the following changes to make blk_queue_cleanup()
behave as proper shutdown.

* QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
queue_lock.

* Unsynchronized DEAD check in generic_make_request_checks() removed.
This couldn't make any meaningful difference as the queue could die
after the check.

* blk_drain_queue() updated such that it can drain all requests and is
now called during cleanup.

* blk_throtl updated such that it checks DEAD on grabbing queue_lock,
drains all throttled bios during cleanup and free td when queue is
released.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:42:16 +0800
bc16a4f93 block: reorganize throtl_get_tg() and blk_throtl_bio() ... Browse Code »

blk_throtl_bio() and throtl_get_tg() have rather unusual interface.

* throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
and drops queue_lock in the latter case. Different locking context
depending on return value is error-prone and DEAD state is scheduled
to be protected by queue_lock anyway. Move DEAD check inside
queue_lock and return valid tg or NULL.

* blk_throtl_bio() indicates return status both with its return value
and in/out param **@bio. The former is used to indicate whether
queue is found to be dead during throtl processing. The latter
whether the bio is throttled.

There's no point in returning DEAD check result from
blk_throtl_bio(). The queue can die after blk_throtl_bio() is
finished but before make_request_fn() grabs queue lock.

Make it take *@bio instead and return boolean result indicating
whether the request is throttled or not.

This patch doesn't cause any visible functional difference.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:33:01 +0800
e3c78ca52 block: reorganize queue draining ... Browse Code »

Reorganize queue draining related code in preparation of queue exit
changes.

* Factor out actual draining from elv_quiesce_start() to
blk_drain_queue().

* Make elv_quiesce_start/end() responsible for their own locking.

* Replace open-coded ELVSWITCH clearing in elevator_switch() with
elv_quiesce_end().

This patch doesn't cause any visible functional difference.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:32:38 +0800
bc9fcbf9c block: move blk_throtl prototypes to block/blk.h ... Browse Code »

blk_throtl interface is block internal and there's no reason to have
them in linux/blkdev.h. Move them to block/blk.h.

This patch doesn't introduce any functional change.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2011-10-19 20:31:18 +0800

16 Aug, 2011

1 commit

4853abaae block: fix flush machinery for stacking drivers with differring flush flags ... Browse Code »

Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
FLUSH/FUA to support merge, introduced a performance regression when
running any sort of fsyncing workload using dm-multipath and certain
storage (in our case, an HP EVA). The test I ran was fs_mark, and it
dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
that dm-multipath always advertised flush+fua support, and passed
commands on down the stack, where those flags used to get stripped off.
The above commit changed that behavior:

static inline struct request *__elv_next_request(struct request_queue *q)
{
struct request *rq;

while (1) {
- while (!list_empty(&q->queue_head)) {
+ if (!list_empty(&q->queue_head)) {
rq = list_entry_rq(q->queue_head.next);
- if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
- (rq->cmd_flags & REQ_FLUSH_SEQ))
- return rq;
- rq = blk_do_flush(q, rq);
- if (rq)
- return rq;
+ return rq;
}

Note that previously, a command would come in here, have
REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

struct request *blk_do_flush(struct request_queue *q, struct request *rq)
{
unsigned int fflags = q->flush_flags; /* may change, cache it */
bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
REQ_FUA);
unsigned skip = 0;
...
if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
rq->cmd_flags &= ~REQ_FLUSH;
if (!has_fua)
rq->cmd_flags &= ~REQ_FUA;
return rq;
}

So, the flush machinery was bypassed in such cases (q->flush_flags == 0
&& rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

Now, however, we don't get into the flush machinery at all. Instead,
__elv_next_request just hands a request with flush and fua bits set to
the scsi_request_fn, even if the underlying request_queue does not
support flush or fua.

The agreed upon approach is to fix the flush machinery to allow
stacking. While this isn't used in practice (since there is only one
request-based dm target, and that target will now reflect the flush
flags of the underlying device), it does future-proof the solution, and
make it function as designed.

In order to make this work, I had to add a field to the struct request,
inside the flush structure (to store the original req->end_io). Shaohua
had suggested overloading the union with rb_node and completion_data,
but the completion data is used by device mapper and can also be used by
other drivers. So, I didn't see a way around the additional field.

I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
the lost performance. Comments and other testers, as always, are
appreciated.

Cheers,
Jeff

Signed-off-by: Jeff Moyer
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Jeff Moyer
2011-08-16 03:37:25 +0800

21 May, 2011

2 commits

0eb8e8857 Merge branch 'for-linus' into for-2.6.40/core ... Browse Code »

This patch merges in a fix that missed 2.6.39 final.

Conflicts:
block/blk.h

Jens Axboe
2011-05-21 02:36:16 +0800
698567f3f Merge commit 'v2.6.39' into for-2.6.40/core ... Browse Code »

Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
had churn in the core area that makes it difficult to handle
patches for eg cfq or blk-throttle. Instead of requiring that they
be based in older versions with bugs that have been fixed later
in the rc cycle, merge in 2.6.39 final.

Also fixes up conflicts in the below files.

Conflicts:
drivers/block/paride/pcd.c
drivers/cdrom/viocd.c
drivers/ide/ide-cd.c

Signed-off-by: Jens Axboe

Jens Axboe
2011-05-21 02:33:15 +0800

19 May, 2011

1 commit

0a58e077e block: add proper state guards to __elv_next_request ... Browse Code »

blk_cleanup_queue() calls elevator_exit() and after this, we can't
touch the elevator without oopsing. __elv_next_request() must check
for this state because in the refcounted queue model, we can still
call it after blk_cleanup_queue() has been called.

This was reported as causing an oops attributable to scsi.

Signed-off-by: James Bottomley
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

James Bottomley
2011-05-19 01:30:32 +0800

07 May, 2011

1 commit

3ac0cc450 block: hold queue if flush is running for non-queueable flush drive ... Browse Code »

In some drives, flush requests are non-queueable. When flush request is
running, normal read/write requests can't run. If block layer dispatches
such request, driver can't handle it and requeue it. Tejun suggested we
can hold the queue when flush is running. This can avoid unnecessary
requeue. Also this can improve performance. For example, we have
request flush1, write1, flush 2. flush1 is dispatched, then queue is
hold, write1 isn't inserted to queue. After flush1 is finished, flush2
will be dispatched. Since disk cache is already clean, flush2 will be
finished very soon, so looks like flush2 is folded to flush1.

In my test, the queue holding completely solves a regression introduced by
commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:

block: make the flush insertion use the tail of the dispatch list

It's not a preempt type request, in fact we have to insert it
behind requests that do specify INSERT_FRONT.

which causes about 20% regression running a sysbench fileio
workload.

Stable: 2.6.39 only

Cc: stable@kernel.org
Signed-off-by: Shaohua Li
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

shaohua.li@intel.com
2011-05-07 01:36:25 +0800

19 Apr, 2011

1 commit

c21e6beba block: get rid of QUEUE_FLAG_REENTER ... Browse Code »

We are currently using this flag to check whether it's safe
to call into ->request_fn(). If it is set, we punt to kblockd.
But we get a lot of false positives and excessive punts to
kblockd, which hurts performance.

The only real abuser of this infrastructure is SCSI. So export
the async queue run and convert SCSI over to use that. There's
room for improvement in that SCSI need not always use the async
call, but this fixes our performance issue and they can fix that
up in due time.

Signed-off-by: Jens Axboe

Jens Axboe
2011-04-19 19:32:46 +0800

18 Apr, 2011

1 commit

24ecfbe27 block: add blk_run_queue_async ... Browse Code »

Instead of overloading __blk_run_queue to force an offload to kblockd
add a new blk_run_queue_async helper to do it explicitly. I've kept
the blk_queue_stopped check for now, but I suspect it's not needed
as the check we do when the workqueue items runs should be enough.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2011-04-18 17:41:33 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

21 Mar, 2011

1 commit

5e84ea3a9 block: attempt to merge with existing requests on plug flush ... Browse Code »

One of the disadvantages of on-stack plugging is that we potentially
lose out on merging since all pending IO isn't always visible to
everybody. When we flush the on-stack plugs, right now we don't do
any checks to see if potential merge candidates could be utilized.

Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
It works just ELEVATOR_INSERT_SORT, but first checks whether we can
merge with an existing request before doing the insertion (if we fail
merging).

This fixes a regression with multiple processes issuing IO that
can be merged.

Thanks to Shaohua Li for testing and fixing
an accounting bug.

Signed-off-by: Jens Axboe

Jens Axboe
2011-03-21 17:14:27 +0800

10 Mar, 2011

1 commit

7eaceacca block: remove per-queue plugging ... Browse Code »
86

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe

Jens Axboe
2011-03-10 15:52:07 +0800

25 Jan, 2011

2 commits

ae1b15396 block: reimplement FLUSH/FUA to support merge ... Browse Code »

The current FLUSH/FUA support has evolved from the implementation
which had to perform queue draining. As such, sequencing is done
queue-wide one flush request after another. However, with the
draining requirement gone, there's no reason to keep the queue-wide
sequential approach.

This patch reimplements FLUSH/FUA support such that each FLUSH/FUA
request is sequenced individually. The actual FLUSH execution is
double buffered and whenever a request wants to execute one for either
PRE or POSTFLUSH, it queues on the pending queue. Once certain
conditions are met, a flush request is issued and on its completion
all pending requests proceed to the next sequence.

This allows arbitrary merging of different type of flushes. How they
are merged can be primarily controlled and tuned by adjusting the
above said 'conditions' used to determine when to issue the next
flush.

This is inspired by Darrick's patches to merge multiple zero-data
flushes which helps workloads with highly concurrent fsync requests.

* As flush requests are never put on the IO scheduler, request fields
used for flush share space with rq->rb_node. rq->completion_data is
moved out of the union. This increases the request size by one
pointer.

As rq->elevator_private* are used only by the iosched too, it is
possible to reduce the request size further. However, to do that,
we need to modify request allocation path such that iosched data is
not allocated for flush requests.

* FLUSH/FUA processing happens on insertion now instead of dispatch.

- Comments updated as per Vivek and Mike.

Signed-off-by: Tejun Heo
Cc: "Darrick J. Wong"
Cc: Shaohua Li
Cc: Christoph Hellwig
Cc: Vivek Goyal
Cc: Mike Snitzer
Signed-off-by: Jens Axboe

Tejun Heo
2011-01-25 19:43:54 +0800
414b4ff5e block: add REQ_FLUSH_SEQ ... Browse Code »

rq == &q->flush_rq was used to determine whether a rq is part of a
flush sequence, which worked because all requests in a flush sequence
were sequenced using the single dedicated request. This is about to
change, so introduce REQ_FLUSH_SEQ flag to distinguish flush sequence
requests.

This patch doesn't cause any behavior change.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2011-01-25 19:43:49 +0800

25 Oct, 2010

1 commit

f253b86b4 Revert "block: fix accounting bug on cross partition merges" ... Browse Code »

This reverts commit 7681bfeeccff5efa9eb29bf09249a3c400b15327.

Conflicts:

include/linux/genhd.h

It has numerous issues with the cleanup path and non-elevator
devices. Revert it for now so we can come up with a clean
version without rushing things.

Signed-off-by: Jens Axboe

Jens Axboe
2010-10-25 04:06:02 +0800

23 Oct, 2010

1 commit

a2887097f Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
xen-blkfront: disable barrier/flush write support
Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
block: remove BLKDEV_IFL_WAIT
aic7xxx_old: removed unused 'req' variable
block: remove the BH_Eopnotsupp flag
block: remove the BLKDEV_IFL_BARRIER flag
block: remove the WRITE_BARRIER flag
swap: do not send discards as barriers
fat: do not send discards as barriers
ext4: do not send discards as barriers
jbd2: replace barriers with explicit flush / FUA usage
jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
jbd: replace barriers with explicit flush / FUA usage
nilfs2: replace barriers with explicit flush / FUA usage
reiserfs: replace barriers with explicit flush / FUA usage
gfs2: replace barriers with explicit flush / FUA usage
btrfs: replace barriers with explicit flush / FUA usage
xfs: replace barriers with explicit flush / FUA usage
block: pass gfp_mask and flags to sb_issue_discard
dm: convey that all flushes are processed as empty
...

Linus Torvalds
2010-10-23 08:07:18 +0800