11 Jan, 2012
1 commit
-
We normally try to avoid reading from write-mostly devices, but when
we do we really have to check for bad blocks and be sure not to
try reading them.With the current code, best_good_sectors might not get set and that
causes zero-length read requests to be send down which is very
confusing.This bug was introduced in commit d2eb35acfdccbe2 and so the patch
is suitable for 3.1.x and 3.2.xReported-and-tested-by: Michał Mirosław
Reported-and-tested-by: Art -kwaak- van Breemen
Signed-off-by: NeilBrown
Cc: stable@vger.kernel.org
23 Dec, 2011
8 commits
-
Now that WantReplacement drives are replaced cleanly, mark a drive
as want_replacement when we see a write error. It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.Signed-off-by: NeilBrown
-
When attempting to add a spare to a RAID1 array, also consider
adding it as a replacement for a want_replacement device.Signed-off-by: NeilBrown
-
If a Replacement is seen, file it as such.
If we see two replacements (or two normal devices) for the one slot,
abort.Signed-off-by: NeilBrown
-
When recovery completes ->spare_active is called.
This checks if the replacement is ready and if so it fails
the original.Signed-off-by: NeilBrown
-
Replacement devices are stored at a different offset, so look
there too.Signed-off-by: NeilBrown
-
In RAID1, a replacement is much like a normal device, so we just
double the size of the relevant arrays and look at all possible
devices for reads and writes.This means that the array looks like it is now double the size in some
way - we need to be careful about that.
In particular, we checking if the array is still degraded while
creating a recovery request we need to only consider the first 'half'
- i.e. the real (non-replacement) devices.Signed-off-by: NeilBrown
-
In general mddev->raid_disks can change unexpectedly while
conf->raid_disks will only change in a very controlled way. So change
some uses of one to the other.The use of mddev->raid_disks will not cause actually problems but
this way is more consistent and safer in the long term.Signed-off-by: NeilBrown
-
Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.Reviewed-by: Dan Williams
Signed-off-by: NeilBrown
07 Nov, 2011
1 commit
-
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include
net: sch_generic remove redundant use of
net: inet_timewait_sock doesnt need
...Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
05 Nov, 2011
1 commit
-
* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c
01 Nov, 2011
1 commit
-
A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for explicitly in advance.Signed-off-by: Paul Gortmaker
26 Oct, 2011
1 commit
-
In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
1/ one place in raid1 was missed and still sets to '1'.
2/ We didn't explicitly set the conf-> value at array creation
time.
It defaulted to '0' just like the mddev value does so they
could appear equal and thus disable recovery.
This did not affect normal 'md' as it calls bind_rdev_to_array
which changes the mddev value. However the dmraid interface
doesn't call this and so doesn't change ->recovery_disabled; so at
array start all recovery is incorrectly disabled.So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.Reported-by: Jonathan Brassow
Signed-off-by: NeilBrown
24 Oct, 2011
1 commit
-
bio originally has the functionality to set the complete cpu, but
it is broken.Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken. The only thing keeping
it serves is creating more confusion and possibly more bugs."And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".So this patch tries to remove all the work of controling complete cpu
from a bio.Cc: Shaohua Li
Cc: Christoph Hellwig
Signed-off-by: Tao Ma
Signed-off-by: Jens Axboe
19 Oct, 2011
1 commit
-
Conflicts:
block/blk-core.c
include/linux/blkdev.hSigned-off-by: Jens Axboe
11 Oct, 2011
7 commits
-
RAID1 and RAID10 handle write requests by queuing them for handling by
a separate thread. This is because when a write-intent-bitmap is
active we might need to update the bitmap first, so it is good to
queue a lot of writes, then do one big bitmap update for them all.However writeback request devices to appear to be congested after a
while so it can make some guesstimate of throughput. The infinite
queue defeats that (note that RAID5 has already has a finite queue so
it doesn't suffer from this problem).So impose a limit on the number of pending write requests. By default
it is 1024 which seems to be generally suitable. Make it configurable
via module option just in case someone finds a regression.Signed-off-by: NeilBrown
-
"mdk" doesn't mean anything any more.
Signed-off-by: NeilBrown
-
Signed-off-by: NeilBrown
-
Signed-off-by: NeilBrown
-
Signed-off-by: NeilBrown
-
Having mddev_t and 'struct mddev_s' is ugly and not preferred
Signed-off-by: NeilBrown
-
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.Signed-off-by: NeilBrown
07 Oct, 2011
3 commits
-
Being able to dynamically enable these make them much more useful.
Signed-off-by: NeilBrown
-
We know which device we just read from so we don't need to
search the bios to find out. Just use ->read_disk.Signed-off-by: NeilBrown
-
When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown
21 Sep, 2011
1 commit
-
Two related problems:
1/ some error paths call "md_unregister_thread(mddev->thread)"
without subsequently clearing ->thread. A subsequent call
to mddev_unlock will try to wake the thread, and crash.2/ Most calls to md_wakeup_thread are protected against the thread
disappeared either by:
- holding the ->mutex
- having an active request, so something else must be keeping
the array active.
However mddev_unlock calls md_wakeup_thread after dropping the
mutex and without any certainty of an active request, so the
->thread could theoretically disappear.
So we need a spinlock to provide some protections.So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.Reported-by: "Moshe Melnikov"
Signed-off-by: NeilBrown
cc: stable@kernel.org
12 Sep, 2011
1 commit
-
There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.Signed-off-by: Christoph Hellwig
Acked-by: NeilBrown
Signed-off-by: Jens Axboe
10 Sep, 2011
1 commit
-
A single request to RAID1 or RAID10 might result in multiple
requests if there are known bad blocks that need to be avoided.To detect if we need to submit another write request we test:
if (sectors_handled < (bio->bi_size >> 9)) {However this is after we call **_write_done() so the 'bio' no longer
belongs to us - the writes could have completed and the bio freed.So move the **_write_done call until after the test against
bio->bi_size.This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862
Reported-by: Bruno Wolff III
Tested-by: Bruno Wolff III
Signed-off-by: NeilBrown
28 Jul, 2011
11 commits
-
raid1d is too big with several deep branches.
So separate them out into their own functions.Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim -
If we cannot read a block from anywhere during recovery, there is
now a better approach than just giving up.
We can record a bad block on each device and keep going - being
careful not to clear the bad block when a write succeeds as it might -
it will be a write of incorrect data.We have now reached the state where - for raid1 - we only call
md_error if md_set_badblocks has failed.Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim -
If we find a bad block while writing as part of resync/recovery we
need to report that back to raid1d which must record the bad block,
or fail the device.Similarly when fixing a read error, a further error should just
record a bad block if possible rather than failing the device.Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim -
When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim -
When performing write-behind we allocate pages to store the data
during write.
Previously we just keep a list of pages. Now we keep a list of
bi_vec which includes offset and size.
This means that the r1bio has complete information to create a new
bio which will be needed for retrying after write errors.Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim -
If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.This requires some delayed handling as the bad-block-list update has
to happen in process-context.Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim -
If we have seen any write error on a drive, then don't write to
any known-bad blocks on that drive.
If necessary, we divide the write request up into pieces just
like we do for reads, so each piece is either all written or
all not written to any given drive.Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim -
It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.If it hasn't we need to wait for the acknowledgement.
We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata. This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed. Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.Signed-off-by: NeilBrown
-
When performing resync/etc, keep the size of the request
small enough that it doesn't overlap any known bad blocks.
Devices with badblocks at the start of the request are completely
excluded.
If there is nowhere to read from due to bad blocks, record
a bad block on each target device.Now that we never read from known-bad-blocks we can allow devices with
known-bad-blocks into a RAID1.Signed-off-by: NeilBrown
-
Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
1/ read_balance needs to check for bad blocks, and return not only
the chosen device, but also how many good blocks are available
there.
2/ fix_read_error needs to avoid trying to read from bad blocks.
3/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio. This count is stored in 'bi_phys_segments'
4/ retrying a read needs to also be ready to submit a smaller read
and queue another request for the rest.This does not yet handle bad blocks when reading to perform resync,
recovery, or check.'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.Signed-off-by: NeilBrown
-
As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim
27 Jul, 2011
1 commit
-
If device-mapper creates a RAID1 array that includes devices to
be rebuilt, it will deref a NULL pointer when finished because
sysfs is not used by device-mapper instantiated RAID devices.Signed-off-by: Jonathan Brassow
Signed-off-by: NeilBrown