Eric Lee / smarc-fsl-linux-kernel

11 Jan, 2012

1 commit

307729c8b md/raid1: perform bad-block tests for WriteMostly devices too. ... Browse Code »

We normally try to avoid reading from write-mostly devices, but when
we do we really have to check for bad blocks and be sure not to
try reading them.

With the current code, best_good_sectors might not get set and that
causes zero-length read requests to be send down which is very
confusing.

This bug was introduced in commit d2eb35acfdccbe2 and so the patch
is suitable for 3.1.x and 3.2.x

Reported-and-tested-by: Michał Mirosław
Reported-and-tested-by: Art -kwaak- van Breemen
Signed-off-by: NeilBrown
Cc: stable@vger.kernel.org

NeilBrown
2012-01-11 05:35:17 +0800

23 Dec, 2011

8 commits

19d671695 md/raid1: Mark device want_replacement when we see a write error. ... Browse Code »

Now that WantReplacement drives are replaced cleanly, mark a drive
as want_replacement when we see a write error. It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:57 +0800
7ef449d1e md/raid1: If there is a spare and a want_replacement device, start replacement. ... Browse Code »

When attempting to add a spare to a RAID1 array, also consider
adding it as a replacement for a want_replacement device.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:57 +0800
c19d57980 md/raid1: recognise replacements when assembling arrays. ... Browse Code »

If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:57 +0800
8c7a2c2bc md/raid1: handle activation of replacement device when recovery completes. ... Browse Code »

When recovery completes ->spare_active is called.
This checks if the replacement is ready and if so it fails
the original.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:57 +0800
b014f14c8 md/raid1: Allow a failed replacement device to be removed. ... Browse Code »

Replacement devices are stored at a different offset, so look
there too.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:56 +0800
8f19ccb2f md/raid1: Allocate spare to store replacement devices and their bios. ... Browse Code »

In RAID1, a replacement is much like a normal device, so we just
double the size of the relevant arrays and look at all possible
devices for reads and writes.

This means that the array looks like it is now double the size in some
way - we need to be careful about that.
In particular, we checking if the array is still degraded while
creating a recovery request we need to only consider the first 'half'
- i.e. the real (non-replacement) devices.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:56 +0800
301946364 md/raid1: Replace use of mddev->raid_disks with conf->raid_disks. ... Browse Code »

In general mddev->raid_disks can change unexpectedly while
conf->raid_disks will only change in a very controlled way. So change
some uses of one to the other.

The use of mddev->raid_disks will not cause actually problems but
this way is more consistent and safer in the long term.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:56 +0800
b8321b68d md: change hot_remove_disk to take an rdev rather than a number. ... Browse Code »

Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement). So removing
a device based on the number won't work. So pass the actual device
handle instead.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:51 +0800

07 Nov, 2011

1 commit

32aaeffbd Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux ... Browse Code »

* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include
net: sch_generic remove redundant use of
net: inet_timewait_sock doesnt need
...

Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h

Linus Torvalds
2011-11-07 11:44:47 +0800

05 Nov, 2011

1 commit

b4fdcb02f Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
block: don't call blk_drain_queue() if elevator is not up
blk-throttle: use queue_is_locked() instead of lockdep_is_held()
blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
blk-throttle: Free up policy node associated with deleted rule
block: warn if tag is greater than real_max_depth.
block: make gendisk hold a reference to its queue
blk-flush: move the queue kick into
blk-flush: fix invalid BUG_ON in blk_insert_flush
block: Remove the control of complete cpu from bio.
block: fix a typo in the blk-cgroup.h file
block: initialize the bounce pool if high memory may be added later
block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
block: drop @tsk from attempt_plug_merge() and explain sync rules
block: make get_request[_wait]() fail if queue is dead
block: reorganize throtl_get_tg() and blk_throtl_bio()
block: reorganize queue draining
block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
block: pass around REQ_* flags instead of broken down booleans during request alloc/free
block: move blk_throtl prototypes to block/blk.h
block: fix genhd refcounting in blkio_policy_parse_and_set()
...

Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
and making the request functions be of type "void" instead of "int" in
- drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
- drivers/staging/zram/zram_drv.c

Linus Torvalds
2011-11-05 08:06:58 +0800

01 Nov, 2011

1 commit

056075c76 md: Add module.h to all files using it implicitly ... Browse Code »

A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for explicitly in advance.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2011-11-01 07:31:18 +0800

26 Oct, 2011

1 commit

d890fa2b0 md: Fix some bugs in recovery_disabled handling. ... Browse Code »

In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
1/ one place in raid1 was missed and still sets to '1'.
2/ We didn't explicitly set the conf-> value at array creation
time.
It defaulted to '0' just like the mddev value does so they
could appear equal and thus disable recovery.
This did not affect normal 'md' as it calls bind_rdev_to_array
which changes the mddev value. However the dmraid interface
doesn't call this and so doesn't change ->recovery_disabled; so at
array start all recovery is incorrectly disabled.

So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.

Reported-by: Jonathan Brassow
Signed-off-by: NeilBrown

NeilBrown
2011-10-26 08:54:39 +0800

24 Oct, 2011

1 commit

9562ad9ab block: Remove the control of complete cpu from bio. ... Browse Code »

bio originally has the functionality to set the complete cpu, but
it is broken.

Chirstoph said that "This code is unused, and from the all the
discussions lately pretty obviously broken. The only thing keeping
it serves is creating more confusion and possibly more bugs."

And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
with leaving cpu control to the request based drivers, they are the
only ones that can toggle the setting anyway".

So this patch tries to remove all the work of controling complete cpu
from a bio.

Cc: Shaohua Li
Cc: Christoph Hellwig
Signed-off-by: Tao Ma
Signed-off-by: Jens Axboe

Tao Ma
2011-10-24 22:11:30 +0800

19 Oct, 2011

1 commit

5c04b426f Merge branch 'v3.1-rc10' into for-3.2/core ... Browse Code »

Conflicts:
block/blk-core.c
include/linux/blkdev.h

Signed-off-by: Jens Axboe

Jens Axboe
2011-10-19 20:30:42 +0800

11 Oct, 2011

7 commits

34db0cd60 md: add proper write-congestion reporting to RAID1 and RAID10. ... Browse Code »

RAID1 and RAID10 handle write requests by queuing them for handling by
a separate thread. This is because when a write-intent-bitmap is
active we might need to update the bitmap first, so it is good to
queue a lot of writes, then do one big bitmap update for them all.

However writeback request devices to appear to be congested after a
while so it can make some guesstimate of throughput. The infinite
queue defeats that (note that RAID5 has already has a finite queue so
it doesn't suffer from this problem).

So impose a limit on the number of pending write requests. By default
it is 1024 which seems to be generally suitable. Make it configurable
via module option just in case someone finds a regression.

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:50:01 +0800
84fc4b56d md: rename "mdk_personality" to "md_personality" ... Browse Code »

"mdk" doesn't mean anything any more.

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:49:58 +0800
e80963604 md/raid1: typedef removal: conf_t -> struct r1conf ... Browse Code »

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:49:05 +0800
0f6d02d58 md: remove typedefs: mirror_info_t -> struct mirror_info ... Browse Code »

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:48:46 +0800
9f2c9d12b md: remove typedefs: r10bio_t -> struct r10bio and r1bio_t -> struct r1bio ... Browse Code »

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:48:43 +0800
fd01b88c7 md: remove typedefs: mddev_t -> struct mddev ... Browse Code »

Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:47:53 +0800
3cb030020 md: removing typedefs: mdk_rdev_t -> struct md_rdev ... Browse Code »

The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:45:26 +0800

07 Oct, 2011

3 commits

36a4e1fe0 md: remove PRINTK and dprintk debugging and use pr_debug ... Browse Code »

Being able to dynamically enable these make them much more useful.

Signed-off-by: NeilBrown

NeilBrown
2011-10-07 11:23:17 +0800
0fc280f60 md/raid1/ avoid bio search in end_sync_read() ... Browse Code »

We know which device we just read from so we don't need to
search the bios to find out. Just use ->read_disk.

Signed-off-by: NeilBrown

NeilBrown
2011-10-07 11:22:55 +0800
ba3ae3bee md/raid1: factor out common bio handling code ... Browse Code »

When normal-write and sync-read/write bio completes, we should
find out the disk number the bio belongs to. Factor those common
code out to a separate function.

Signed-off-by: Namhyung Kim
Signed-off-by: NeilBrown

Namhyung Kim
2011-10-07 11:22:53 +0800

21 Sep, 2011

1 commit

01f96c0a9 md: Avoid waking up a thread after it has been freed. ... Browse Code »
1

Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
without subsequently clearing ->thread. A subsequent call
to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
disappeared either by:
- holding the ->mutex
- having an active request, so something else must be keeping
the array active.
However mddev_unlock calls md_wakeup_thread after dropping the
mutex and without any certainty of an active request, so the
->thread could theoretically disappear.
So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.

Reported-by: "Moshe Melnikov"
Signed-off-by: NeilBrown
cc: stable@kernel.org

NeilBrown
2011-09-21 13:30:20 +0800

12 Sep, 2011

1 commit

5a7bbad27 block: remove support for bio remapping from ->make_request ... Browse Code »
86

There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.

Signed-off-by: Christoph Hellwig
Acked-by: NeilBrown
Signed-off-by: Jens Axboe

Christoph Hellwig
2011-09-12 18:12:01 +0800

10 Sep, 2011

1 commit

079fa166a md/raid1,10: Remove use-after-free bug in make_request. ... Browse Code »

A single request to RAID1 or RAID10 might result in multiple
requests if there are known bad blocks that need to be avoided.

To detect if we need to submit another write request we test:
if (sectors_handled < (bio->bi_size >> 9)) {

However this is after we call **_write_done() so the 'bio' no longer
belongs to us - the writes could have completed and the bio freed.

So move the **_write_done call until after the test against
bio->bi_size.

This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862

Reported-by: Bruno Wolff III
Tested-by: Bruno Wolff III
Signed-off-by: NeilBrown

NeilBrown
2011-09-10 15:21:23 +0800

28 Jul, 2011

11 commits

62096bce2 md/raid1: factor several functions out or raid1d() ... Browse Code »

raid1d is too big with several deep branches.
So separate them out into their own functions.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-28 09:38:13 +0800
3a9f28a51 md/raid1: improve handling of read failure during recovery. ... Browse Code »

If we cannot read a block from anywhere during recovery, there is
now a better approach than just giving up.
We can record a bad block on each device and keep going - being
careful not to clear the bad block when a write succeeds as it might -
it will be a write of incorrect data.

We have now reached the state where - for raid1 - we only call
md_error if md_set_badblocks has failed.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-28 09:33:42 +0800
d8f05d299 md/raid1: record badblocks found during resync etc. ... Browse Code »

If we find a bad block while writing as part of resync/recovery we
need to report that back to raid1d which must record the bad block,
or fail the device.

Similarly when fixing a read error, a further error should just
record a bad block if possible rather than failing the device.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-28 09:33:00 +0800
cd5ff9a16 md/raid1: Handle write errors by updating badblock log. ... Browse Code »

When we get a write error (in the data area, not in metadata),
update the badblock log rather than failing the whole device.

As the write may well be many blocks, we trying writing each
block individually and only log the ones which fail.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-28 09:32:41 +0800
2ca68f5ed md/raid1: store behind-write pages in bi_vecs. ... Browse Code »

When performing write-behind we allocate pages to store the data
during write.
Previously we just keep a list of pages. Now we keep a list of
bi_vec which includes offset and size.
This means that the r1bio has complete information to create a new
bio which will be needed for retrying after write errors.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-28 09:32:10 +0800
4367af556 md/raid1: clear bad-block record when write succeeds. ... Browse Code »
43

If we succeed in writing to a block that was recorded as
being bad, we clear the bad-block record.

This requires some delayed handling as the bad-block-list update has
to happen in process-context.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-28 09:31:49 +0800
1f68f0c4b md/raid1: avoid writing to known-bad blocks on known-bad drives. ... Browse Code »

If we have seen any write error on a drive, then don't write to
any known-bad blocks on that drive.
If necessary, we divide the write request up into pieces just
like we do for reads, so each piece is either all written or
all not written to any given drive.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-28 09:31:48 +0800
de393cdea md: make it easier to wait for bad blocks to be acknowledged. ... Browse Code »

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata. This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed. Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks. So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:31:48 +0800
06f603851 md/raid1: avoid reading known bad blocks during resync ... Browse Code »

When performing resync/etc, keep the size of the request
small enough that it doesn't overlap any known bad blocks.
Devices with badblocks at the start of the request are completely
excluded.
If there is nowhere to read from due to bad blocks, record
a bad block on each target device.

Now that we never read from known-bad-blocks we can allow devices with
known-bad-blocks into a RAID1.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:31:48 +0800
d2eb35acf md/raid1: avoid reading from known bad blocks. ... Browse Code »

Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
1/ read_balance needs to check for bad blocks, and return not only
the chosen device, but also how many good blocks are available
there.
2/ fix_read_error needs to avoid trying to read from bad blocks.
3/ read submission must be ready to issue multiple reads to
different devices as different bad blocks on different devices
could mean that a single large read cannot be served by any one
device, but can still be served by the array.
This requires keeping count of the number of outstanding requests
per bio. This count is stored in 'bi_phys_segments'
4/ retrying a read needs to also be ready to submit a smaller read
and queue another request for the rest.

This does not yet handle bad blocks when reading to perform resync,
recovery, or check.

'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:31:48 +0800
34b343cff md: don't allow arrays to contain devices with bad blocks. ... Browse Code »

As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.

This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-28 09:31:47 +0800

27 Jul, 2011

1 commit

654e8b5ab MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe ... Browse Code »

If device-mapper creates a RAID1 array that includes devices to
be rebuilt, it will deref a NULL pointer when finished because
sysfs is not used by device-mapper instantiated RAID devices.

Signed-off-by: Jonathan Brassow
Signed-off-by: NeilBrown

Jonathan Brassow
2011-07-27 09:00:36 +0800