Eric Lee / smarc-fsl-linux-kernel

07 Jun, 2008

3 commits

c337869d9 md: do not compute parity unless it is on a failed drive ... Browse Code »

If a block is computed (rather than read) then a check/repair operation
may be lead to believe that the data on disk is correct, when infact it
isn't. So only compute blocks for failed devices.

This issue has been around since at least 2.6.12, but has become harder to
hit in recent kernels since most reads bypass the cache.

echo repair > /sys/block/mdN/md/sync_action will set the parity blocks to the
correct state.

Cc:
Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-06-07 02:29:08 +0800
a6d8113a9 md: fix uninitialized use of mddev->recovery_wait ... Browse Code »

If an array was created with --assume-clean we will oops when trying to
set ->resync_max.

Fix this by initializing ->recovery_wait in mddev_find.

Cc:
Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-06-07 02:29:08 +0800
e0a115e5a md: fix prexor vs sync_request race ... Browse Code »

During the initial array synchronization process there is a window between
when a prexor operation is scheduled to a specific stripe and when it
completes for a sync_request to be scheduled to the same stripe. When
this happens the prexor completes and the stripe is unconditionally marked
"insync", effectively canceling the sync_request for the stripe. Prior to
2.6.23 this was not a problem because the prexor operation was done under
sh->lock. The effect in older kernels being that the prexor would still
erroneously mark the stripe "insync", but sync_request would be held off
and re-mark the stripe as "!in_sync".

Change the write completion logic to not mark the stripe "in_sync" if a
prexor was performed. The effect of the change is to sometimes not set
STRIPE_INSYNC. The worst this can do is cause the resync to stall waiting
for STRIPE_INSYNC to be set. If this were happening, then STRIPE_SYNCING
would be set and handle_issuing_new_read_requests would cause all
available blocks to eventually be read, at which point prexor would never
be used on that stripe any more and STRIPE_INSYNC would eventually be set.

echo repair > /sys/block/mdN/md/sync_action will correct arrays that may
have lost this race.

Cc:
Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-06-07 02:29:08 +0800

25 May, 2008

8 commits

dfc706450 md: restart recovery cleanly after device failure. ... Browse Code »

When we get any IO error during a recovery (rebuilding a spare), we abort
the recovery and restart it.

For RAID6 (and multi-drive RAID1) it may not be best to restart at the
beginning: when multiple failures can be tolerated, the recovery may be
able to continue and re-doing all that has already been done doesn't make
sense.

We already have the infrastructure to record where a recovery is up to
and restart from there, but it is not being used properly.
This is because:
- We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
which causes the recovery not be be checkpointed.
- We remove spares and then re-added them which loses important state
information.

The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
needed. If there is an error, the relevant drive will be marked as
Faulty, and that is enough to ensure correct handling of the error. So we
first remove MD_RECOVERY_ERR, changing some of the uses of it to
MD_RECOVERY_INTR.

Then we cause the attempt to remove a non-faulty device from an array to
fail (unless recovery is impossible as the array is too degraded). Then
when remove_and_add_spares attempts to remove the devices on which
recovery can continue, it will fail, they will remain in place, and
recovery will continue on them as desired.

Issue: If we are halfway through rebuilding a spare and another drive
fails, and a new spare is immediately available, do we want to:
1/ complete the current rebuild, then go back and rebuild the new spare or
2/ restart the rebuild from the start and rebuild both devices in
parallel.

Both options can be argued for. The code currently takes option 2 as
a/ this requires least code change
b/ this results in a minimally-degraded array in minimal time.

Cc: "Eivind Sarto"
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2008-05-25 00:56:10 +0800
90b08710e md: allow parallel resync of md-devices. ... Browse Code »

In some configurations, a raid6 resync can be limited by CPU speed
(Calculating P and Q and moving data) rather than by device speed. In
these cases there is nothing to be gained byt serialising resync of arrays
that share a device, and doing the resync in parallel can provide benefit.
So add a sysfs tunable to flag an array as being allowed to resync in
parallel with other arrays that use (a different part of) the same device.

Signed-off-by: Bernd Schubert
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bernd Schubert
2008-05-25 00:56:10 +0800
4f54b0e94 md: notify userspace on 'stop' events ... Browse Code »

This additional notification to 'array_state' is needed to allow the
monitor application to learn about stop events via sysfs. The
sysfs_notify("sync_action") call that comes at the end of do_md_stop()
(via md_new_event) is insufficient since the 'sync_action' attribute has
been removed by this point.

(Seems like a sysfs-notify-on-removal patch is a better fix. Currently
removal updates the event count but does not wake up waiters)

Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-05-25 00:56:10 +0800
09a44cc15 md: notify userspace on 'write-pending' changes to array_state ... Browse Code »

When an array enters write pending, 'array_state' changes, so we must be
sure to sysfs_notify.

Also, when waiting for user-space to acknowledge 'write-pending' by
marking the metadata as dirty, we don't want to wait for MD_CHANGE_DEVS to
be cleared as that might not happen. So explicity test for the bits that
we are really interested in.

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2008-05-25 00:56:10 +0800
698b18c1e md: raid1: Fix restoration of bio between failed read and write. ... Browse Code »

When performing a "recovery" or "check" pass on a RAID1 array, we read
from each device and possible, if there is a difference or a read error,
write back to some devices.

We use the same 'bio' for both read and write, resetting various fields
between the two operations.

We forgot to reset bv_offset and bv_len however. These are often left
unchanged, but in the case where there is an IO error one or two sectors
into a page, they are changed.

This results in correctable errors not being corrected properly. It does
not result in any data corruption.

Cc: "Fairbanks, David"
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2008-05-25 00:56:10 +0800
6be9d4940 md: md: raid5 rate limit error printk ... Browse Code »

Last night we had scsi problems and a hardware raid unit was offlined
during heavy i/o. While this happened we got for about 3 minutes a huge
number messages like these

Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2).

I guess the high error rate is responsible for not scheduling other events
- during this time the system was not pingable and in the end also other
devices run into scsi command timeouts causing problems on these unrelated
devices as well.

Signed-off-by: Bernd Schubert
Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bernd Schubert
2008-05-25 00:56:10 +0800
6bcfd6018 md: kill file_path wrapper ... Browse Code »

Kill the trivial and rather pointless file_path wrapper around d_path.

Signed-off-by: Christoph Hellwig
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2008-05-25 00:56:09 +0800
84255d101 md: fix possible oops when removing a bitmap from an active array ... Browse Code »

It is possible to add a write-intent bitmap to an active array, or remove
the bitmap that is there.

When we do with the 'quiesce' the array, which causes make_request to
block in "wait_barrier()".

However we are sampling the value of "mddev->bitmap" before the
wait_barrier call, and using it afterwards. This can result in using a
bitmap structure that has been freed.

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2008-05-25 00:56:09 +0800

15 May, 2008

1 commit

e7e72bf64 Remove blkdev warning triggered by using md ... Browse Code »

As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.

For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q->queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q->__queue_lock. So always initialise that lock when allocated.

With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.

Thanks to Dan Williams for review and fixing the silly bugs.

Signed-off-by: NeilBrown
Cc: Dan Williams
Cc: Jens Axboe
Cc: Alistair John Strachan
Cc: Nick Piggin
Cc: "Rafael J. Wysocki"
Cc: Jacek Luczak
Cc: Prakash Punnoor
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Brown
2008-05-15 10:11:15 +0800

13 May, 2008

1 commit

c8894419a md: fix raid5 'repair' operations ... Browse Code »

commit bd2ab67030e9116f1e4aae1289220255412b37fd "md: close a livelock window
in handle_parity_checks5" introduced a bug in handling 'repair' operations.
After a repair operation completes we clear the state bits tracking this
operation. However, they are cleared too early and this results in the code
deciding to re-run the parity check operation. Since we have done the repair
in memory the second check does not find a mismatch and thus does not do a
writeback.

Test results:
$ echo repair > /sys/block/md0/md/sync_action
$ cat /sys/block/md0/md/mismatch_cnt
51072
$ echo repair > /sys/block/md0/md/sync_action
$ cat /sys/block/md0/md/mismatch_cnt
0

(also fix incorrect indentation)

Cc:
Tested-by: George Spelvin
Acked-by: NeilBrown
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-05-13 23:02:24 +0800

09 May, 2008

1 commit

cb6969e8c misc: fix integer as NULL pointer warnings ... Browse Code »

drivers/md/raid10.c:889:17: warning: Using plain integer as NULL pointer
drivers/media/video/cx18/cx18-driver.c:616:12: warning: Using plain integer as NULL pointer
sound/oss/kahlua.c:70:12: warning: Using plain integer as NULL pointer

Signed-off-by: Harvey Harrison
Cc: Neil Brown
Cc: Mauro Carvalho Chehab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-05-09 01:46:55 +0800

30 Apr, 2008

9 commits

6bfe0b499 md: support blocking writes to an array on device failure ... Browse Code »

Allows a userspace metadata handler to take action upon detecting a device
failure.

Based on an original patch by Neil Brown.

Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev->external

Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-04-30 23:29:33 +0800
11e2ede02 md: prevent duplicates in bind_rdev_to_array ... Browse Code »

Found when trying to reassemble an active externally managed array. Without
this check we hit the more noisy "sysfs duplicate" warning in the later call
to kobject_add.

Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-04-30 23:29:33 +0800
242b363e2 md: remove a stray command from a copy and paste error in resync_start_store ... Browse Code »

Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-04-30 23:29:33 +0800
648b629ed md: fix up switching md arrays between read-only and read-write ... Browse Code »

When setting an array to 'readonly' or to 'active' via sysfs, we must make the
appropriate set_disk_ro call too.

Also when switching to "read_auto" (which is like readonly, but blocks on the
first write so that metadata can be marked 'dirty') we need to be more careful
about what state we are changing from.

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2008-04-30 23:29:32 +0800
31a59e342 md: fix 'safemode' handling for external metadata. ... Browse Code »

'safemode' relates to marking an array as 'clean' if there has been no write
traffic for a while (a couple of seconds), to reduce the chance of the array
being found dirty on reboot.

->safemode is set to '1' when there have been no write for a while, and it
gets set to '0' when the superblock is updates with the 'clean' flag set.

This requires a few fixes for 'external' metadata:
- When an array is set to 'clean' via sysfs, 'safemode' must be cleared.
- when we write to an array that has 'safemode' set (there must have been
some delay in updating the metadata), we need to clear safemode.
- Don't try to update external metadata in md_check_recovery for safemode
transitions - it won't work.

Also, don't try to support "immediate safe mode" (safemode==2) for external
metadata, it cannot really work (the safemode timeout can be set very low if
this is really needed).

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2008-04-30 23:29:32 +0800
d897dbf91 md: reinitialise more mddev fields in do_md_stop. ... Browse Code »

I keep finding problems where an mddev gets reused and some fields has a value
from a previous usage that confuses the new usage. So clear all fields that
could possible need clearing when calling do_md_stop.

Also initialise the 'level' of a new array to LEVEL_NONE (which isn't 0).

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2008-04-30 23:29:32 +0800
8377bc808 md: skip all metadata update processing when using external metadata. ... Browse Code »

All the metadata update processing for external metadata is on in user-space
or through the sysfs interfaces, so make "md_update_sb" a no-op in that case.

Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2008-04-30 23:29:32 +0800
6a51830e1 md: fix use after free when removing rdev via sysfs ... Browse Code »

rdev->mddev is no longer valid upon return from entry->store() when the
'remove' command is given.

Cc:
Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-04-30 23:29:32 +0800
c9a3f6d6f dm: use unlocked variants of queue flag check/set ... Browse Code »

dm.c already provides mutual exclusion through ->map_lock.

Signed-off-by: Jens Axboe
Signed-off-by: Linus Torvalds

Jens Axboe
2008-04-30 01:21:12 +0800

29 Apr, 2008

4 commits

bd5d435a9 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Skip I/O merges when disabled
block: add large command support
block: replace sizeof(rq->cmd) with BLK_MAX_CDB
ide: use blk_rq_init() to initialize the request
block: use blk_rq_init() to initialize the request
block: rename and export rq_init()
block: no need to initialize rq->cmd with blk_get_request
block: no need to initialize rq->cmd in prepare_flush_fn hook
block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline
block/elevator.c:elv_rq_merge_ok() mustn't be inline
block: make queue flags non-atomic
block: add dma alignment and padding support to blk_rq_map_kern
unexport blk_max_pfn
ps3disk: Remove superfluous cast
block: make rq_init() do a full memset()
relay: fix splice problem

Linus Torvalds
2008-04-29 23:18:03 +0800
c7705f344 drivers: use non-racy method for proc entries creation (2) ... Browse Code »

Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
be setup before gluing PDE to main tree.

Signed-off-by: Denis V. Lunev
Cc: Greg Kroah-Hartman
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Peter Osterlund
Cc: Bartlomiej Zolnierkiewicz
Cc: Dmitry Torokhov
Cc: Neil Brown
Cc: Mauro Carvalho Chehab
Cc: Bjorn Helgaas
Cc: Alessandro Zummo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Denis V. Lunev
2008-04-29 23:06:22 +0800
992b5bcee block: no need to initialize rq->cmd with blk_get_request ... Browse Code »

blk_get_request initializes rq->cmd (rq_init does) so the users don't
need to do that.

The purpose of this patch is to remove sizeof(rq->cmd) and &rq->cmd,
as a preparation for large command support, which changes rq->cmd from
the static array to a pointer. sizeof(rq->cmd) will not make sense and
&rq->cmd won't work.

Signed-off-by: FUJITA Tomonori
Cc: James Bottomley
Cc: Alasdair G Kergon
Cc: Jens Axboe
Signed-off-by: Jens Axboe

FUJITA Tomonori
2008-04-29 20:48:55 +0800
75ad23bc0 block: make queue flags non-atomic ... Browse Code »

We can save some atomic ops in the IO path, if we clearly define
the rules of how to modify the queue flags.

Signed-off-by: Jens Axboe

Nick Piggin
2008-04-29 20:48:33 +0800

28 Apr, 2008

6 commits

62b0559aa drivers/md: use time_before, time_before_eq, etc ... Browse Code »

The functions time_before, time_before_eq, time_after, and time_after_eq
are more robust for comparing jiffies against other values.

A simplified version of the semantic patch making this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)

//
@ change_compare_np @
expression E;
@@

(
- jiffies = E
+ time_after_eq(jiffies,E)
|
- jiffies < E
+ time_before(jiffies,E)
|
- jiffies > E
+ time_after(jiffies,E)
)

@ include depends on change_compare_np @
@@

#include

@ no_include depends on !include && change_compare_np @
@@

#include
+ #include
//

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Julia Lawall
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Julia Lawall
2008-04-28 23:58:42 +0800
d7a420c94 raid: remove leading TAB on printk messages ... Browse Code »

MD drivers use one printk() call to print 2 log messages and the second line
may be prefixed by a TAB character. It may also output a trailing space
before newline. klogd (I think) turns the TAB character into the 2 characters
'^I' when logging to a file. This looks ugly.

Instead of a leading TAB to indicate continuation, prefix both output lines
with 'raid:' or similar. Also remove any trailing space in the vicinity of
the affected code and consistently end the sentences with a period.

Signed-off-by: Nick Andrew
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Andrew
2008-04-28 23:58:42 +0800
4ef197d87 md: raid5.c convert simple_strtoul to strict_strtoul ... Browse Code »

strict_strtoul handles the open-coded sanity checks in
raid5_store_stripe_cache_size and raid5_store_preread_threshold

Acked-by: NeilBrown
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-04-28 23:58:42 +0800
8b3e6cdc5 md: introduce get_priority_stripe() to improve raid456 write performance ... Browse Code »

Improve write performance by preventing the delayed_list from dumping all its
stripes onto the handle_list in one shot. Delayed stripes are now further
delayed by being held on the 'hold_list'. The 'hold_list' is bypassed when:

* a STRIPE_IO_STARTED stripe is found at the head of 'handle_list'
* 'handle_list' is empty and i/o is being done to satisfy full stripe-width
write requests
* 'bypass_count' is less than 'bypass_threshold'. By default the threshold
is 1, i.e. every other stripe handled is a preread stripe provided the
top two conditions are false.

Benchmark data:
System: 2x Xeon 5150, 4x SATA, mem=1GB
Baseline: 2.6.24-rc7
Configuration: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
Test1: dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
* patched: +33% (stripe_cache_size = 256), +25% (stripe_cache_size = 512)

Test2: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 (XFS)
* patched: +13%
* patched + preread_bypass_threshold = 0: +37%

Changes since v1:
* reduce bypass_threshold from (chunk_size / sectors_per_chunk) to (1) and
make it configurable. This defaults to fairness and modest performance
gains out of the box.
Changes since v2:
* [neilb@suse.de]: kill STRIPE_PRIO_HI and preread_needed as they are not
necessary, the important change was clearing STRIPE_DELAYED in
add_stripe_bio and this has been moved out to make_request for the hang
fix.
* [neilb@suse.de]: simplify get_priority_stripe
* [dan.j.williams@intel.com]: reset the bypass_count when ->hold_list is
sampled empty (+11%)
* [dan.j.williams@intel.com]: decrement the bypass_count at the detection
of stripes being naturally promoted off of hold_list +2%. Note, resetting
bypass_count instead of decrementing on these events yields +4% but that is
probably too aggressive.
Changes since v3:
* cosmetic fixups

Tested-by: James W. Laferriere
Signed-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2008-04-28 23:58:42 +0800
e46b272b6 md: replace remaining __FUNCTION__ occurrences ... Browse Code »

__FUNCTION__ is gcc-specific, use __func__

Signed-off-by: Harvey Harrison
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-04-28 23:58:42 +0800
9a7b2b0f3 md: fix integer as NULL pointer warnings in md.c ... Browse Code »

drivers/md/md.c:734:16: warning: Using plain integer as NULL pointer
drivers/md/md.c:1115:16: warning: Using plain integer as NULL pointer

Add some braces to match the else-block as well.

Signed-off-by: Harvey Harrison
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Harvey Harrison
2008-04-28 23:58:42 +0800

25 Apr, 2008

7 commits

cf13ab8e0 dm: remove md argument from specific_minor ... Browse Code »

The small patch below:
- Removes the unused md argument from both specific_minor() and next_free_minor()
- Folds kmalloc + memset(0) into a single kzalloc call in alloc_dev()

This has been compile tested on x86.

Signed-off-by: Frederik Deweerdt
Signed-off-by: Alasdair G Kergon

Frederik Deweerdt
2008-04-25 20:27:02 +0800
4fdfe401e dm table: remove unused dm_create_error_table ... Browse Code »

dm_create_error_table() was added in kernel 2.6.18 and never used...

Signed-off-by: Adrian Bunk
Signed-off-by: Alasdair G Kergon

Adrian Bunk
2008-04-25 20:27:00 +0800
e8488d085 dm table: drop void suspend_targets return ... Browse Code »

void returning functions returned the return value of another void
returning function...

Spotted by sparse.

Signed-off-by: Adrian Bunk
Signed-off-by: Alasdair G Kergon

Adrian Bunk
2008-04-25 20:26:59 +0800
7ff14a361 dm: unplug queues in threads ... Browse Code »

Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.

It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
queue (i.e. block the requests from being dispatched to the physical device).

The queue is unplugged when the caller calls blk_unplug() function. Usually, the
sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
the queue and waits (to be possibly joined with other adjacent bios). Then, when
the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
the disk.

This was happenning:

When doing O_SYNC writes, function fsync_buffers_list() submits a list of
bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
woken up.

fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but
there are no bios on the device queue as they are still in the dm_raid1 queue.

wait_on_buffer() starts waiting until the IO is finished.

kmirrord is scheduled, kmirrord takes bios and submits them to the devices.

The submitted bio plugs the harddisk queue but there is no one to unplug it.
(The process that called wait_on_buffer() is already sleeping.)

So there is a 3ms timeout, after which the queues on the harddisks are
unplugged and requests are processed.

This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
dm-raid1 is 10 times slower than md raid1.

Every time we submit something asynchronously via dm_io, we must unplug the
queue actually to send the request to the device.

This patch adds an unplug call to kmirrord - while processing requests, it keeps
the queue plugged (so that adjacent bios can be merged); when it finishes
processing all the bios, it unplugs the queue to submit the bios.

It also fixes kcopyd which has the same potential problem. All kcopyd requests
are submitted with BIO_RW_SYNC.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon
Acked-by: Jens Axboe

Mikulas Patocka
2008-04-25 20:26:57 +0800
a2aebe03b dm raid1: use timer ... Browse Code »

This patch replaces the schedule() in the main kmirrord thread with a timer.
The schedule() could introduce an unwanted delay when work is ready to be
processed.

The code instead calls wake() when there's work to be done immediately, and
delayed_wake() after a failure to give a short delay before retrying.

Signed-off-by: Mikulas Patocka
Signed-off-by: Alasdair G Kergon

Mikulas Patocka
2008-04-25 20:26:56 +0800
a765e20ee dm: move include files ... Browse Code »

Publish the dm-io, dm-log and dm-kcopyd headers in include/linux.

Signed-off-by: Alasdair G Kergon

Alasdair G Kergon
2008-04-25 20:26:55 +0800
2d1e580af dm kcopyd: rename ... Browse Code »

Rename kcopyd.[ch] to dm-kcopyd.[ch].

Signed-off-by: Alasdair G Kergon

Alasdair G Kergon
2008-04-25 20:26:54 +0800