Eric Lee / smarc-fsl-linux-kernel

23 Dec, 2011

13 commits

3a6de2924 md/raid5: Mark device want_replacement when we see a write error. ... Browse Code »

Now that WantReplacement drives are replaced cleanly, mark a drive
as WantReplacement when we see a write error. It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:54 +0800
7bfec5f35 md/raid5: If there is a spare and a want_replacement device, start replacement. ... Browse Code »
43

When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a want_replacement device.

This requires that common md code attempt hot_add even when the array
is not formally degraded.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:53 +0800
17045f52a md/raid5: recognise replacements when assembling array. ... Browse Code »

If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:53 +0800
dd054fce8 md/raid5: handle activation of replacement device when recovery completes. ... Browse Code »

When recovery completes - as reported by a call to ->spare_active,
we clear In_sync on the original and set it on the replacement.

Then when the original gets removed we move the replacement from
'replacement' to 'rdev'.

This could race with other code that is looking at these pointers,
so we use memory barriers and careful ordering to ensure that
a reader might see one device twice, but never no devices.
Then the readers guard against using both devices, which could
only happen when writing.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:53 +0800
9a3e1101b md/raid5: detect and handle replacements during recovery. ... Browse Code »
43

During recovery we want to write to the replacement but not
the original. So we have two new flags
- R5_NeedReplace if this stripe has a replacement that needs to
be written at some stage
- R5_WantReplace if NeedReplace, and the data is available, and
a 'sync' has been requested on this stripe.

We also distinguish between 'sync and replace' which need to read
all other devices, and 'replace' which only needs to read the
devices being replaced.

Note that during resync we always write to any replacement device.
It might not need to be written to, but as we don't read to compare,
we have to write to be sure.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:53 +0800
977df3625 md/raid5: writes should get directed to replacement as well as original. ... Browse Code »

When writing, we need to submit two writes, one to the original, and
one to the replacement - if there is a replacement.

If the write to the replacement results in a write error, we just fail
the device. We only try to record write errors to the original.

When writing for recovery, we shouldn't write to the original. This
will be addressed in a subsequent patch that generally addresses
recovery.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:53 +0800
657e3e4d8 md/raid5: allow removal for failed replacement devices. ... Browse Code »

Enhance raid5_remove_disk to be able to remove ->replacement
as well as ->rdev.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:52 +0800
14a75d3e0 md/raid5: preferentially read from replacement device if possible. ... Browse Code »

If a replacement device is present and has been recovered far enough,
then use it for reading into the stripe cache.

If we get an error we don't try to repair it, we just fail the device.
A replacement device that gives errors does not sound sensible.

This requires removing the setting of R5_ReadError when we get
a read error during a read that bypasses the cache. It was probably
a bad idea anyway as we don't know that every block in the read
caused an error, and it could cause ReadError to be set for the
replacement device, which is bad.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:52 +0800
995c4275a md/raid5: remove redundant bio initialisations. ... Browse Code »

We current initialise some fields of a bio when preparing a
stripe_head, and again just before submitting the request.

Remove the duplication by only setting the fields that lower level
devices don't touch in raid5_build_block, and only set the changeable
fields in ops_run_io.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:52 +0800
671488cc2 md/raid5: allow each slot to have an extra replacement device ... Browse Code »

Just enhance data structures to record a second device per slot to be
used as a 'replacement' device, replacing the original.
We also have a second bio in each slot in each stripe_head. This will
only be used when writing to the array - we need to write to both the
original and the replacement at the same time, so will need two bios.

For now, only try using the replacement drive for aligned-reads.
In this case, we prefer the replacement if it has been recovered far
enough, otherwise use the original.

This includes a small enhancement. Previously we would only do
aligned reads if the target device was fully recovered. Now we also
do them if it has recovered far enough.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:52 +0800
b8321b68d md: change hot_remove_disk to take an rdev rather than a number. ... Browse Code »

Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:51 +0800
908f4fbd2 md/raid5: be more thorough in calculating 'degraded' value. ... Browse Code »
43

When an array is being reshaped to change the number of devices,
the two halves can be differently degraded. e.g. one could be
missing a device and the other not.

So we need to be more careful about calculating the 'degraded'
attribute.

Instead of just inc/dec at appropriate times, perform a full
re-calculation examining both possible cases. This doesn't happen
often so it not a big cost, and we already have most of the code to
do it.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:50 +0800
30d7a4836 md/raid5: ensure correct assessment of drives during degraded reshape. ... Browse Code »

While reshaping a degraded array (as when reshaping a RAID0 by first
converting it to a degraded RAID4) we currently get confused about
which devices are in_sync. In most cases we get it right, but in the
region that is being reshaped we need to treat non-failed devices as
in-sync when we have the data but haven't actually written it out yet.

Reported-by: Adam Kwolek
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 06:57:00 +0800

09 Dec, 2011

1 commit

5d8c71f9e md: raid5 crash during degradation ... Browse Code »

NULL pointer access causes crash in raid5 module.

Signed-off-by: Adam Kwolek
Signed-off-by: NeilBrown

Adam Kwolek
2011-12-09 11:26:11 +0800

08 Dec, 2011

1 commit

9283d8c5a md/raid5: never wait for bad-block acks on failed device. ... Browse Code »

Once a device is failed we really want to completely ignore it.
It should go away soon anyway.

In particular the presence of bad blocks on it should not cause us to
block as we won't be trying to write there anyway.

So as soon as we can check if a device is Faulty, do so and pretend
that it is already gone if it is Faulty.

Signed-off-by: NeilBrown

NeilBrown
2011-12-08 13:27:57 +0800

08 Nov, 2011

2 commits

257a4b42a md/raid5: STRIPE_ACTIVE has lock semantics, add barriers ... Browse Code »

All updates that occur under STRIPE_ACTIVE should be globally visible
when STRIPE_ACTIVE clears. test_and_set_bit() implies a barrier, but
clear_bit() does not.

This is suitable for 3.1-stable.

Signed-off-by: Dan Williams
Signed-off-by: NeilBrown
Cc: stable@kernel.org

Dan Williams
2011-11-08 13:22:06 +0800
9a3f530f3 md/raid5: abort any pending parity operations when array fails. ... Browse Code »
1

When the number of failed devices exceeds the allowed number
we must abort any active parity operations (checks or updates) as they
are no longer meaningful, and can lead to a BUG_ON in
handle_parity_checks6.

This bug was introduce by commit 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8
in 2.6.29.

Reported-by: Manish Katiyar
Tested-by: Manish Katiyar
Acked-by: Dan Williams
Signed-off-by: NeilBrown
Cc: stable@kernel.org

NeilBrown
2011-11-08 13:22:01 +0800

07 Nov, 2011

1 commit

32aaeffbd Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux ... Browse Code »

* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include
net: sch_generic remove redundant use of
net: inet_timewait_sock doesnt need
...

Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h

Linus Torvalds
2011-11-07 11:44:47 +0800

05 Nov, 2011

1 commit

b4fdcb02f Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c

Linus Torvalds
2011-11-05 08:06:58 +0800

01 Nov, 2011

1 commit

056075c76 md: Add module.h to all files using it implicitly ... Browse Code »

A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for explicitly in advance.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2011-11-01 07:31:18 +0800

26 Oct, 2011

2 commits

d890fa2b0 md: Fix some bugs in recovery_disabled handling. ... Browse Code »

In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
1/ one place in raid1 was missed and still sets to '1'.
2/ We didn't explicitly set the conf-> value at array creation
time.
It defaulted to '0' just like the mddev value does so they
could appear equal and thus disable recovery.
This did not affect normal 'md' as it calls bind_rdev_to_array
which changes the mddev value. However the dmraid interface
doesn't call this and so doesn't change ->recovery_disabled; so at
array start all recovery is incorrectly disabled.

So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.

Reported-by: Jonathan Brassow
Signed-off-by: NeilBrown

NeilBrown
2011-10-26 08:54:39 +0800
355840e7a md/raid5: fix bug that could result in reads from a failed device. ... Browse Code »
2

This bug was introduced in 415e72d034c50520ddb7ff79e7d1792c1306f0c9
which was in 2.6.36.

There is a small window of time between when a device fails and when
it is removed from the array. During this time we might still read
from it, but we won't write to it - so it is possible that we could
read stale data.

We didn't need the test of 'Faulty' before because the test on
In_sync is sufficient. Since we started allowing reads from the early
part of non-In_sync devices we need a test on Faulty too.

This is suitable for any kernel from 2.6.36 onwards, though the patch
might need a bit of tweaking in 3.0 and earlier.

Cc: stable@kernel.org
Signed-off-by: NeilBrown

NeilBrown
2011-10-26 07:31:04 +0800

19 Oct, 2011

1 commit

5c04b426f Merge branch 'v3.1-rc10' into for-3.2/core ... Browse Code »

Conflicts:
block/blk-core.c
include/linux/blkdev.h

Signed-off-by: Jens Axboe

Jens Axboe
2011-10-19 20:30:42 +0800

11 Oct, 2011

5 commits

84fc4b56d md: rename "mdk_personality" to "md_personality" ... Browse Code »

"mdk" doesn't mean anything any more.

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:49:58 +0800
d1688a6d5 md/raid5: typedef removal: raid5_conf_t -> struct r5conf ... Browse Code »

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:49:52 +0800
e373ab109 md/raid0: typedef removal: raid0_conf_t -> struct r0conf ... Browse Code »

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:48:59 +0800
fd01b88c7 md: remove typedefs: mddev_t -> struct mddev ... Browse Code »

Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:47:53 +0800
3cb030020 md: removing typedefs: mdk_rdev_t -> struct md_rdev ... Browse Code »

The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:45:26 +0800

07 Oct, 2011

3 commits

bdc04e6b1 md: remove some old DEBUGging code. ... Browse Code »

This code is not really helpful and is hard to maintain, so just
discard it.

Signed-off-by: NeilBrown

NeilBrown
2011-10-07 11:23:04 +0800
db298e194 md/raid5: convert to macros into inline functions. ... Browse Code »

More type-safety. Easier to read.

Signed-off-by: NeilBrown

NeilBrown
2011-10-07 11:23:00 +0800
e4f869d9d md/raid5: remove pointless NULL test. ... Browse Code »

In the 'abort' branch of run(), 'conf' cannot possibly be NULL,
so remove the test.

Reported-by: Zdenek Kabelac
Signed-off-by: NeilBrown

NeilBrown
2011-10-07 11:22:49 +0800

21 Sep, 2011

1 commit

01f96c0a9 md: Avoid waking up a thread after it has been freed. ... Browse Code »
1

Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
without subsequently clearing ->thread. A subsequent call
to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
disappeared either by:
- holding the ->mutex
- having an active request, so something else must be keeping
the array active.
However mddev_unlock calls md_wakeup_thread after dropping the
mutex and without any certainty of an active request, so the
->thread could theoretically disappear.
So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.

Reported-by: "Moshe Melnikov"
Signed-off-by: NeilBrown
cc: stable@kernel.org

NeilBrown
2011-09-21 13:30:20 +0800

12 Sep, 2011

1 commit

5a7bbad27 block: remove support for bio remapping from ->make_request ... Browse Code »
86

There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig
Acked-by: NeilBrown
Signed-off-by: Jens Axboe

Christoph Hellwig
2011-09-12 18:12:01 +0800

31 Aug, 2011

1 commit

43220aa0f md/raid5: fix a hang on device failure. ... Browse Code »
43

Waiting for a 'blocked' rdev to become unblocked in the raid5d thread
cannot work with internal metadata as it is the raid5d thread which
will clear the blocked flag.
This wasn't a problem in 3.0 and earlier as we only set the blocked
flag when external metadata was used then.
However we now set it always, so we need to be more careful.

Signed-off-by: NeilBrown

NeilBrown
2011-08-31 10:49:14 +0800

28 Jul, 2011

6 commits

b84db560e md/raid5: Clear bad blocks on successful write. ... Browse Code »

On a successful write to a known bad block, flag the sh
so that raid5d can remove the known bad block from the list.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:39:23 +0800
73e92e51b md/raid5. Don't write to known bad block on doubtful devices. ... Browse Code »
43

If a device has seen write errors, don't write to any known
bad blocks on that device.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:39:22 +0800
bc2607f39 md/raid5: write errors should be recorded as bad blocks if possible. ... Browse Code »

When a write error is detected, don't mark the device as failed
immediately but rather record the fact for handle_stripe to deal with.

Handle_stripe then attempts to record a bad block. Only if that fails
does the device get marked as faulty.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:39:22 +0800
7f0da59bd md/raid5: use bad-block log to improve handling of uncorrectable read errors. ... Browse Code »

If we get an uncorrectable read error - record a bad block rather than
failing the device.
And if these errors (which may be due to known bad blocks) cause
recovery to be impossible, record a bad block on the recovering
devices, or abort the recovery.

As we might abort a recovery without failing a device we need to teach
RAID5 about recovery_disabled handling.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:39:22 +0800
31c176ecd md/raid5: avoid reading from known bad blocks. ... Browse Code »
43

There are two times that we might read in raid5:
1/ when a read request fits within a chunk on a single
working device.
In this case, if there is any bad block in the range of
the read, we simply fail the cache-bypass read and
perform the read though the stripe cache.

2/ when reading into the stripe cache. In this case we
mark as failed any device which has a bad block in that
strip (1 page wide).
Note that we will both avoid reading and avoid writing.
This is correct (as we will never read from the block, there
is no point writing), but not optimal (as writing could 'fix'
the error) - that will be addressed later.

If we have not seen any write errors on the device yet, we treat a bad
block like a recent read error. This will encourage an attempt to fix
the read error which will either generate a write error, or will
ensure good data is stored there. We don't yet forget the bad block
in that case. That comes later.

Now that we honour bad blocks when reading we can allow devices with
bad blocks into the array.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:39:22 +0800
de393cdea md: make it easier to wait for bad blocks to be acknowledged. ... Browse Code »

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata. This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed. Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:31:48 +0800