15 May, 2008
1 commit
-
As setting and clearing queue flags now requires that we hold a spinlock
on the queue, and as blk_queue_stack_limits is called without that lock,
get the lock inside blk_queue_stack_limits.For blk_queue_stack_limits to be able to find the right lock, each md
personality needs to set q->queue_lock to point to the appropriate lock.
Those personalities which didn't previously use a spin_lock, us
q->__queue_lock. So always initialise that lock when allocated.With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
longer cause warnings as it will be clear that the proper lock is held.Thanks to Dan Williams for review and fixing the silly bugs.
Signed-off-by: NeilBrown
Cc: Dan Williams
Cc: Jens Axboe
Cc: Alistair John Strachan
Cc: Nick Piggin
Cc: "Rafael J. Wysocki"
Cc: Jacek Luczak
Cc: Prakash Punnoor
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Apr, 2008
1 commit
-
Allows a userspace metadata handler to take action upon detecting a device
failure.Based on an original patch by Neil Brown.
Changes:
-added blocked_wait waitqueue to rdev
-don't qualify Blocked with Faulty always let userspace block writes
-added md_wait_for_blocked_rdev to wait for the block device to be clear, if
userspace misses the notification another one is sent every 5 seconds
-set MD_RECOVERY_NEEDED after clearing "blocked"
-kill DoBlock flag, just test mddev->externalSigned-off-by: Dan Williams
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Apr, 2008
1 commit
-
MD drivers use one printk() call to print 2 log messages and the second line
may be prefixed by a TAB character. It may also output a trailing space
before newline. klogd (I think) turns the TAB character into the 2 characters
'^I' when logging to a file. This looks ugly.Instead of a leading TAB to indicate continuation, prefix both output lines
with 'raid:' or similar. Also remove any trailing space in the vicinity of
the affected code and consistently end the sentences with a period.Signed-off-by: Nick Andrew
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Mar, 2008
2 commits
-
Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
another possible deadlock in raid1/raid10 error handing.If a read request returns an error while a resync is happening and a resync
request is pending, the attempt to fix the error will block until the resync
progresses, and the resync will block until the read request completes. Thus
a deadlock.This patch fixes the problem.
Cc: "K.Tanaka"
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When handling a read error, we freeze the array to stop any other IO while
attempting to over-write with correct data.This is done in the raid1d(raid10d) thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the retry
queue - these are counted in ->nr_queue and will stay there during a freeze).However write requests need attention from raid1d as bitmap updates might be
required. This can cause a deadlock as raid1 is waiting for requests to
finish that themselves need attention from raid1d.So we create a new function 'flush_pending_writes' to give that attention, and
call it in freeze_array to be sure that we aren't waiting on raid1d.Thanks to "K.Tanaka" for finding and reporting this
problem.Cc: "K.Tanaka"
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Feb, 2008
3 commits
-
As this is more in line with common practice in the kernel. Also swap the
args around to be more like list_for_each.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This allows userspace to control resync/reshape progress and synchronise it
with other activities, such as shared access in a SAN, or backing up critical
sections during a tricky reshape.Writing a number of sectors (which must be a multiple of the chunk size if
such is meaningful) causes a resync to pause when it gets to that point.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently an md array with a write-intent bitmap does not updated that bitmap
to reflect successful partial resync. Rather the entire bitmap is updated
when the resync completes.This is because there is no guarentee that resync requests will complete in
order, and tracking each request individually is unnecessarily burdensome.However there is value in regularly updating the bitmap, so add code to
periodically pause while all pending sync requests complete, then update the
bitmap. Doing this only every few seconds (the same as the bitmap update
time) does not notciably affect resync performance.[snitzer@gmail.com: export bitmap_cond_end_sync]
Signed-off-by: Neil Brown
Cc: "Mike Snitzer"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Nov, 2007
1 commit
-
Added blk_unplug interface, allowing all invocations of unplugs to result
in a generated blktrace UNPLUG.Signed-off-by: Alan D. Brunelle
Signed-off-by: Jens Axboe
20 Oct, 2007
1 commit
-
* Convert files to UTF-8.
* Also correct some people's names
(one example is Eißfeldt, which was found in a source file.
Given that the author used an ß at all in a source file
indicates that the real name has in fact a 'ß' and not an 'ss',
which is commonly used as a substitute for 'ß' when limited to
7bit.)* Correct town names (Goettingen -> Göttingen)
* Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313)
Signed-off-by: Jan Engelhardt
Signed-off-by: Adrian Bunk
17 Oct, 2007
1 commit
-
Whenever a read error is found, we should attempt to overwrite with correct
data to 'fix' it.However when do a 'check' pass (which compares data blocks that are
successfully read, but doesn't normally overwrite) we don't do that. We
should.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Oct, 2007
1 commit
-
Then we can get rid of ->issue_flush_fn() and all the driver private
implementations of that.Signed-off-by: Jens Axboe
10 Oct, 2007
1 commit
-
As bi_end_io is only called once when the reqeust is complete,
the 'size' argument is now redundant. Remove it.Now there is no need for bio_endio to subtract the size completed
from bi_size. So don't do that either.While we are at it, change bi_end_io to return void.
Signed-off-by: Neil Brown
Signed-off-by: Jens Axboe
23 Aug, 2007
2 commits
-
When a raid1 array is reshaped (number of drives changed), the list of devices
is compacted, so that slots for missing devices are filled with working
devices from later slots. This requires the "rd%d" symlinks in sysfs to be
updated.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Commit 1757128438d41670ded8bc3bc735325cc07dc8f9 was slightly bad. If an array
has a write-intent bitmap, and you remove a drive, then readd it, only the
changed parts should be resynced. However after the above commit, this only
works if the array has not been shut down and restarted.This is because it sets 'fullsync' at little more often than it should. This
patch is more careful.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jul, 2007
1 commit
-
Some of the code has been gradually transitioned to using the proper
struct request_queue, but there's lots left. So do a full sweet of
the kernel and get rid of this typedef and replace its uses with
the proper type.Signed-off-by: Jens Axboe
18 Jul, 2007
1 commit
-
bitmap_unplug only ever returns 0, so it may as well be void. Two callers try
to print a message if it returns non-zero, but that message is already printed
by bitmap_file_kick.write_page returns an error which is not consistently checked. It always
causes BITMAP_WRITE_ERROR to be set on an error, and that can more
conveniently be checked.When the return of write_page is checked, an error causes bitmap_file_kick to
be called - so move that call into write_page - and protect against recursive
calls into bitmap_file_kick.bitmap_update_sb returns an error that is never checked.
So make these 'void' and be consistent about checking the bit.
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jun, 2007
1 commit
-
If raid1/repair (which reads all block and fixes any differences it finds)
hits a read error, it doesn't reset the bio for writing before writing
correct data back, so the read error isn't fixed, and the device probably
gets a zero-length write which it might complain about.Signed-off-by: Neil Brown
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 May, 2007
1 commit
-
When a raid1 has only one working drive, we want read error to propagate up
to the filesystem as there is no point failing the last drive in an array.Currently the code perform this check is racy. If a write and a read a
both submitted to a device on a 2-drive raid1, and the write fails followed
by the read failing, the read will see that there is only one working drive
and will pass the failure up, even though the one working drive is actually
the *other* one.So, tighten up the locking.
Signed-off-by: Neil Brown
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 May, 2007
2 commits
-
This reverts commit 5b479c91da90eef605f851508744bfe8269591a0.
Quoth Neil Brown:
"It causes an oops when auto-detecting raid arrays, and it doesn't
seem easy to fix.The array may not be 'open' when do_md_run is called, so
bdev->bd_disk might be NULL, so bd_set_size can oops.This whole approach of opening an md device before it has been
assembled just seems to get more and more painful. I think I'm going
to have to come up with something clever to provide both backward
comparability with usage expectation, and sane integration into the
rest of the kernel."Signed-off-by: Linus Torvalds
-
md currently uses ->media_changed to make sure rescan_partitions
is call on md array after they are assembled.However that doesn't happen until the array is opened, which is later
than some people would like.So use blkdev_ioctl to do the rescan immediately that the
array has been assembled.This means we can remove all the ->change infrastructure as it was only used
to trigger a partition rescan.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jan, 2007
2 commits
-
If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held,
it is possible for a deadlock to eventuate.This happens if the array was marked 'clean', and the memalloc triggers a
write-out to the md device.For the writeout to succeed, the array must be marked 'dirty', and that
requires getting the mddev_lock.So, before attempting a GFP_KERNEL allocation while holding the lock, make
sure the array is marked 'dirty' (unless it is currently read-only).Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When 'repair' finds a block that is different one the various parts of the
mirror. it is meant to write a chosen good version to the others. However it
currently writes out the original data to each. The memcpy to make all the
data the same is missing.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Jan, 2007
1 commit
-
md raidX make_request functions strip off the BIO_RW_SYNC flag, thus
introducing additional latency.Fixing this in raid1 and raid10 seems to be straightforward enough.
For our particular usage case in DRBD, passing this flag improved some
initialization time from ~5 minutes to ~5 seconds.Acked-by: NeilBrown
Signed-off-by: Lars Ellenberg
Acked-by: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Dec, 2006
1 commit
-
Thanks Jens for alerting me to this.
Cc: Jens Axboe
Cc:
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Dec, 2006
1 commit
-
Fix few bugs that meant that:
- superblocks weren't alway written at exactly the right time (this
could show up if the array was not written to - writting to the array
causes lots of superblock updates and so hides these errors).- restarting device recovery after a clean shutdown (version-1 metadata
only) didn't work as intended (or at all).1/ Ensure superblock is updated when a new device is added.
2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
The body of this if takes one of two branches depending on whether
MD_RECOVERY_SYNC is set, so testing it in the clause of the if
is wrong.
3/ Flag superblock for updating after a resync/recovery finishes.
4/ If we find the neeed to restart a recovery in the middle (version-1
metadata only) make sure a full recovery (not just as guided by
bitmaps) does get done.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Oct, 2006
1 commit
-
drivers/md/raid1.c:1479: warning: long long unsigned int format, long unsigned int arg (arg 4)
drivers/md/raid10.c:1475: warning: long long unsigned int format, long unsigned int arg (arg 4)Signed-off-by: Randy Dunlap
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 Oct, 2006
5 commits
-
raid1, raid10 and multipath don't report their 'congested' status through
bdi_*_congested, but should.This patch adds the appropriate functions which just check the 'congested'
status of all active members (with appropriate locking).raid1 read_balance should be modified to prefer devices where
bdi_read_congested returns false. Then we could use the '&' branch rather
than the '|' branch. However that should would need some benchmarking first
to make sure it is actually a good idea.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The error handling routines don't use proper locking, and so two concurrent
errors could trigger a problem.So:
- use test-and-set and test-and-clear to synchonise
the In_sync bits with the ->degraded count
- use the spinlock to protect updates to the
degraded count (could use an atomic_t but that
would be a bigger change in code, and isn't
really justified)
- remove un-necessary locking in raid5Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It is equivalent to conf->raid_disks - conf->mddev->degraded.
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
raid1d has toooo many nested block, so take the fix_read_error functionality
out into a separate function.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Instead of magic numbers (0,1,2,3) in sb_dirty, we have
some flags instead:
MD_CHANGE_DEVS
Some device state has changed requiring superblock update
on all devices.
MD_CHANGE_CLEAN
The array has transitions from 'clean' to 'dirty' or back,
requiring a superblock update on active devices, but possibly
not on spares
MD_CHANGE_PENDING
A superblock update is underway.We wait for an update to complete by waiting for all flags to be clear. A
flag can be set at any time, even during an update, without risk that the
change will be lost.Stop exporting md_update_sb - isn't needed.
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Sep, 2006
1 commit
-
We need to be careful when referencing mirrors[i].rdev. It can disappear
under us at various times.So:
fix a couple of problem places.
comment a couple of non-problem places
move an 'atomic_add' which deferences rdev down a little
way to some where where it is sure to not be NULL.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Aug, 2006
1 commit
-
A recent patch broke the ability to do a user-request check of a raid1.
This patch fixes the breakage and also moves a comment that was dislocated
by the same patch.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Jul, 2006
2 commits
-
This is generally useful, but particularly helps see if it is the same sector
that always needs correcting, or different ones.[akpm@osdl.org: fix printk warnings]
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Though it rarely matters, we should be using 's' rather than r1_bio->sector
here.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jun, 2006
3 commits
-
When an array has a bitmap, a device can be removed and re-added and only
blocks changes since the removal (as recorded in the bitmap) will be resynced.It should be possible to do a similar thing to arrays without bitmaps. i.e.
if a device is removed and re-added and *no* changes have been made in the
interim, then the add should not require a resync.This patch allows that option. This means that when assembling an array one
device at a time (e.g. during device discovery) the array can be enabled
read-only as soon as enough devices are available, but extra devices can still
be added without causing a resync.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For a while we have had checkpointing of resync. The version-1 superblock
allows recovery to be checkpointed as well, and this patch implements that.Due to early carelessness we need to add a feature flag to signal that the
recovery_offset field is in use, otherwise older kernels would assume that a
partially recovered array is in fact fully recovered.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A recent change made this goto unnecessary, so reformat the code to make it
clearer what is happening.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 May, 2006
1 commit
-
When retrying a failed BIO_RW_BARRIER request, we need to keep the reference
in ->nr_pending over the whole retry. Currently, we only hold the reference
if the failed request is the *last* one to finish - which is silly, because it
would normally be the first to finish.So move the rdev_dec_pending call up into the didn't-fail branch. As the rdev
isn't used in the later code, calling rdev_dec_pending earlier doesn't hurt.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds