09 May, 2007
1 commit
-
Remove do_sync_file_range() and convert callers to just use
do_sync_mapping_range().Signed-off-by: Mark Fasheh
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 May, 2007
1 commit
-
Remove the destroy_dirty_buffers argument from invalidate_bdev(), it hasn't
been used in 6 years (so akpm says).find * -name \*.[ch] | xargs grep -l invalidate_bdev |
while read file; do
quilt add $file;
sed -ie 's/invalidate_bdev(\([^,]*\),[^)]*)/invalidate_bdev(\1)/g' $file;
doneSigned-off-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Apr, 2007
1 commit
-
Currently we scale the mempool sizes depending on memory installed
in the machine, except for the bio pool itself which sits at a fixed
256 entry pre-allocation.There's really no point in "optimizing" this OOM path, we just need
enough preallocated to make progress. A single unit is enough, lets
scale it down to 2 just to be on the safe side.This patch saves ~150kb of pinned kernel memory on a 32-bit box.
Signed-off-by: Jens Axboe
13 Apr, 2007
1 commit
-
If 'num_pages' were ever 1 more than a multiple of 8 (32bit platforms)
or of 16 (64 bit platforms). filemap_attr would be allocated one
'unsigned long' shorter than required. We need a round-up in there.Signed-off-by: Neil Brown
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Apr, 2007
1 commit
-
A device can be removed from an md array via e.g.
echo remove > /sys/block/md3/md/dev-sde/stateThis will try to remove the 'dev-sde' subtree which will deadlock
since
commit e7b0d26a86943370c04d6833c6edba2a72a6e240With this patch we run the kobject_del via schedule_work so as to
avoid the deadlock.Cc: Alan Stern
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Mar, 2007
3 commits
-
... still not sure why we need this ....
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If this mddev and queue got reused for another array that doesn't register a
congested_fn, this function would get called incorretly.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
All that is missing the the function pointers in raid4_pers.
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Mar, 2007
1 commit
-
When iterating through an array, one must be careful to test one's index
variable rather than another similarly-named variable.The loop will read off the end of conf->disks[] in the following
(pathological) case:% dd bs=1 seek=840716287 if=/dev/zero of=d1 count=1
% for i in 2 3 4; do dd if=/dev/zero of=d$i bs=1k count=$(($i+150)); done
% ./vmlinux ubd0=root ubd1=d1 ubd2=d2 ubd3=d3 ubd4=d4
# mdadm -C /dev/md0 --level=linear --raid-devices=4 /dev/ubd[1234]adding some printks, I saw this:
[42949374.960000] hash_spacing = 821120
[42949374.960000] cnt = 4
[42949374.960000] min_spacing = 801
[42949374.960000] j=0 size=820928 sz=820928
[42949374.960000] i=0 sz=820928 hash_spacing=820928
[42949374.960000] j=1 size=64 sz=64
[42949374.960000] j=2 size=64 sz=128
[42949374.960000] j=3 size=64 sz=192
[42949374.960000] j=4 size=1515870810 sz=1515871002Cc: Gautham R Shenoy
Acked-by: Neil Brown
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Mar, 2007
1 commit
-
Recent patch for raid6 reshape had a change missing that showed up in
subsequent review.Many places in the raid5 code used "conf->raid_disks-1" to mean "number of
data disks". With raid6 that had to be changed to "conf->raid_disk -
conf->max_degraded" or similar. One place was missed.This bug means that if a raid6 reshape were aborted in the middle the
recorded position would be wrong. On restart it would either fail (as the
position wasn't on an appropriate boundary) or would leave a section of the
array unreshaped, causing data corruption.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
02 Mar, 2007
6 commits
-
i.e. one or more drives can be added and the array will re-stripe
while on-line.Most of the interesting work was already done for raid5. This just extends it
to raid6.mdadm newer than 2.6 is needed for complete safety, however any version of
mdadm which support raid5 reshape will do a good enough job in almost all
cases (an 'echo repair > /sys/block/mdX/md/sync_action' is recommended after a
reshape that was aborted and had to be restarted with an such a version of
mdadm).Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
An error always aborts any resync/recovery/reshape on the understanding that
it will immediately be restarted if that still makes sense. However a reshape
currently doesn't get restarted. With this patch it does.To avoid restarting when it is not possible to do work, we call into the
personality to check that a reshape is ok, and strengthen raid5_check_reshape
to fail if there are too many failed devices.We also break some code out into a separate function: remove_and_add_spares as
the indent level for that code was getting crazy.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The mddev and queue might be used for another array which does not set these,
so they need to be cleared.Signed-off-by: NeilBrown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
md tries to warn the user if they e.g. create a raid1 using two partitions of
the same device, as this does not provide true redundancy.However it also warns if a raid0 is created like this, and there is nothing
wrong with that.At the place where the warning is currently printer, we don't necessarily know
what level the array will be, so move the warning from the point where the
device is added to the point where the array is started.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
- Use kernel_fpu_begin() and kernel_fpu_end()
- Use boot_cpu_has() for feature testing even in userspaceSigned-off-by: H. Peter Anvin
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are two errors that can lead to recovery problems with raid10
when used in 'far' more (not the default).Due to a '>' instead of '>=' the wrong block is located which would result in
garbage being written to some random location, quite possible outside the
range of the device, causing the newly reconstructed device to fail.The device size calculation had some rounding errors (it didn't round when it
should) and so recovery would go a few blocks too far which would again cause
a write to a random block address and probably a device error.The code for working with device sizes was fairly confused and spread out, so
this has been tided up a bit.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Feb, 2007
2 commits
-
The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.Signed-off-by: Eric W. Biederman
Acked-by: Ralf Baechle
Acked-by: Martin Schwidefsky
Cc: Russell King
Cc: David Howells
Cc: "Luck, Tony"
Cc: Ralf Baechle
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Andi Kleen
Cc: Jens Axboe
Cc: Corey Minyard
Cc: Neil Brown
Cc: "John W. Linville"
Cc: James Bottomley
Cc: Jan Kara
Cc: Trond Myklebust
Cc: Mark Fasheh
Cc: David Chinner
Cc: "David S. Miller"
Cc: Patrick McHardy
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The sysctls used by the md driver are have unique binary numbers so remove the
insert_at_head flag as it serves no useful purpose.Signed-off-by: Eric W. Biederman
Cc: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Feb, 2007
1 commit
-
Many struct file_operations in the kernel can be "const". Marking them const
moves these to the .rodata section, which avoids false sharing with potential
dirty data. In addition it'll catch accidental writes at compile time to
these shared resources.[akpm@sdl.org: dvb fix]
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Feb, 2007
1 commit
-
Convert all calls to invalidate_inode_pages() into open-coded calls to
invalidate_mapping_pages().Leave the invalidate_inode_pages() wrapper in place for now, marked as
deprecated.Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Feb, 2007
2 commits
-
md/bitmap tracks how many active write requests are pending on blocks
associated with each bit in the bitmap, so that it knows when it can clear
the bit (when count hits zero).The counter has 14 bits of space, so if there are ever more than 16383, we
cannot cope.Currently the code just calles BUG_ON as "all" drivers have request queue
limits much smaller than this.However is seems that some don't. Apparently some multipath configurations
can allow more than 16383 concurrent write requests.So, in this unlikely situation, instead of calling BUG_ON we now wait
for the count to drop down a bit. This requires a new wait_queue_head,
some waiting code, and a wakeup call.Tested by limiting the counter to 20 instead of 16383 (writes go a lot slower
in that case...).Signed-off-by: Neil Brown
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
It is possible for raid5 to be sent a bio that is too big for an underlying
device. So if it is a READ that we pass stright down to a device, it will
fail and confuse RAID5.So in 'chunk_aligned_read' we check that the bio fits within the parameters
for the target device and if it doesn't fit, fall back on reading through
the stripe cache and making lots of one-page requests.Note that this is the earliest time we can check against the device because
earlier we don't have a lock on the device, so it could change underneath
us.Also, the code for handling a retry through the cache when a read fails has
not been tested and was badly broken. This patch fixes that code.Signed-off-by: Neil Brown
Cc: "Kai"
Cc:
Cc:
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jan, 2007
6 commits
-
raid5_mergeable_bvec tries to ensure that raid5 never sees a read request
that does not fit within just one chunk. However as we must always accept
a single-page read, that is not always possible.So when "in_chunk_boundary" fails, it might be unusual, but it is not a
problem and printing a message every time is a bad idea.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held,
it is possible for a deadlock to eventuate.This happens if the array was marked 'clean', and the memalloc triggers a
write-out to the md device.For the writeout to succeed, the array must be marked 'dirty', and that
requires getting the mddev_lock.So, before attempting a GFP_KERNEL allocation while holding the lock, make
sure the array is marked 'dirty' (unless it is currently read-only).Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Allow noflush suspend/resume of device-mapper device only for the case
where the device size is unchanged.Otherwise, dm-multipath devices can stall when resumed if noflush was used
when suspending them, all paths have failed and queue_if_no_path is set.Explanation:
1. Something is doing fsync() on the block dev,
holding inode->i_sem
2. The fsync write is blocked by all-paths-down and queue_if_no_path
3. Someone requests to suspend the dm device with noflush.
Pending writes are left in queue.
4. In the middle of dm_resume(), __bind() tries to get
inode->i_sem to do __set_size() and waits forever.'noflush suspend' is a new device-mapper feature introduced in
early 2.6.20. So I hope the fix being included before 2.6.20 is
released.Example of reproducer:
1. Create a multipath device by dmsetup
2. Fail all paths during mkfs
3. Do dmsetup suspend --noflush and load new map with healthy paths
4. Do dmsetup resumeSigned-off-by: Jun'ichi Nomura
Acked-by: Alasdair G Kergon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In most cases we check the size of the bitmap file before reading data from
it. However when reading the superblock, we always read the first PAGE_SIZE
bytes, which might not always be appropriate. So limit that read to the size
of the file if appropriate.Also, we get the count of available bytes wrong in one place, so that too can
read past the end of the file.Cc: "yang yin"
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that we sometimes step the array events count backwards (when
transitioning dirty->clean where nothing else interesting has happened - so
that we don't need to write to spares all the time), it is possible for the
event count to return to zero, which is potentially confusing and triggers and
MD_BUG.We could possibly remove the MD_BUG, but is just as easy, and probably safer,
to make sure we never return to zero.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When 'repair' finds a block that is different one the various parts of the
mirror. it is meant to write a chosen good version to the others. However it
currently writes out the original data to each. The memcpy to make all the
data the same is missing.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Jan, 2007
1 commit
-
md raidX make_request functions strip off the BIO_RW_SYNC flag, thus
introducing additional latency.Fixing this in raid1 and raid10 seems to be straightforward enough.
For our particular usage case in DRBD, passing this flag improved some
initialization time from ~5 minutes to ~5 seconds.Acked-by: NeilBrown
Signed-off-by: Lars Ellenberg
Acked-by: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Dec, 2006
1 commit
-
While developing more functionality in mdadm I found some bugs in md...
- When we remove a device from an inactive array (write 'remove' to
the 'state' sysfs file - see 'state_store') would should not
update the superblock information - as we may not have
read and processed it all properly yet.- initialise all raid_disk entries to '-1' else the 'slot sysfs file
will claim '0' for all devices in an array before the array is
started.- all '\n' not to be present at the end of words written to
sysfs files- when we use SET_ARRAY_INFO to set the md metadata version,
set the flag to say that there is persistant metadata.- allow GET_BITMAP_FILE to be called on an array that hasn't
been started yet.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Dec, 2006
1 commit
-
Thanks Jens for alerting me to this.
Cc: Jens Axboe
Cc:
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Dec, 2006
9 commits
-
As CBC is the default chaining method for cryptoloop, we should select
it from cryptoloop to ease the transition. Spotted by Rene Herman.Signed-off-by: Herbert Xu
Signed-off-by: Linus Torvalds -
Fix few bugs that meant that:
- superblocks weren't alway written at exactly the right time (this
could show up if the array was not written to - writting to the array
causes lots of superblock updates and so hides these errors).- restarting device recovery after a clean shutdown (version-1 metadata
only) didn't work as intended (or at all).1/ Ensure superblock is updated when a new device is added.
2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
The body of this if takes one of two branches depending on whether
MD_RECOVERY_SYNC is set, so testing it in the clause of the if
is wrong.
3/ Flag superblock for updating after a resync/recovery finishes.
4/ If we find the neeed to restart a recovery in the middle (version-1
metadata only) make sure a full recovery (not just as guided by
bitmaps) does get done.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error
to higher levels. While this should be sufficient, it is safer to explicitly
set the error code as well - less room for confusion.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are some vestiges of old code that was used for bypassing the stripe
cache on reads in raid5.c. This was never updated after the change from
buffer_heads to bios, but was left as a reminder.That functionality has nowe been implemented in a completely different way, so
the old code can go.Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The autorun code is only used if this module is built into the static
kernel image. Adjust #ifdefs accordingly.Signed-off-by: Jeff Garzik
Acked-by: NeilBrown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
stripe_to_pdidx finds the index of the parity disk for a given stripe. It
assumes raid5 in that it uses "disks-1" to determine the number of data disks.This is incorrect for raid6 but fortunately the two usages cancel each other
out. The only way that 'data_disks' affects the calculation of pd_idx in
raid5_compute_sector is when it is divided into the sector number. But as
that sector number is calculated by multiplying in the wrong value of
'data_disks' the division produces the right value.So it is innocuous but needs to be fixed.
Also change the calculation of raid_disks in compute_blocknr to make it
more obviously correct (it seems at first to always use disks-1 too).Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Call the chunk_aligned_read where appropriate.
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If a bypass-the-cache read fails, we simply try again through the cache. If
it fails again it will trigger normal recovery precedures.update 1:
From: NeilBrown
1/
chunk_aligned_read and retry_aligned_read assume that
data_disks == raid_disks - 1
which is not true for raid6.
So when an aligned read request bypasses the cache, we can get the wrong data.2/ The cloned bio is being used-after-free in raid5_align_endio
(to test BIO_UPTODATE).3/ We forgot to add rdev->data_offset when submitting
a bio for aligned-read4/ clone_bio calls blk_recount_segments and then we change bi_bdev,
so we need to invalidate the segment counts.5/ We don't de-reference the rdev when the read completes.
This means we need to record the rdev to so it is still
available in the end_io routine. Fortunately
bi_next in the original bio is unused at this point so
we can stuff it in there.6/ We leak a cloned bio if the target rdev is not usable.
From: NeilBrown
update 2:
1/ When aligned requests fail (read error) they need to be retried
via the normal method (stripe cache). As we cannot be sure that
we can process a single read in one go (we may not be able to
allocate all the stripes needed) we store a bio-being-retried
and a list of bioes-that-still-need-to-be-retried.
When find a bio that needs to be retried, we should add it to
the list, not to single-bio...2/ We were never incrementing 'scnt' when resubmitting failed
aligned requests.[akpm@osdl.org: build fix]
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Neil Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds