Doug / smarc-fsl-linux-kernel | Embedian Git Server

30 Aug, 2010

1 commit

070dc6dd7 md: resolve confusion of MD_CHANGE_CLEAN ... Browse Code »

MD_CHANGE_CLEAN is used for two different purposes and this leads to
confusion.
One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
not used for anything else, so have MD_CHANGE_PENDING take over that
purpose fully.

The two purposes are:
1/ tell md_update_sb that an update is needed and that it is just a
clean/dirty transition.
2/ tell user-space that an transition from clean to dirty is pending
(something wants to write), and tell te kernel (by clearin the
flag) that the transition is OK.

The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
fully to MD_CHANGE_PENDING.

This means that various places which conditionally set or cleared
MD_CHANGE_CLEAN no longer need to be conditional.

Signed-off-by: NeilBrown

NeilBrown
2010-08-30 16:06:21 +0800

11 Aug, 2010

1 commit

3d30701b5 Merge branch 'for-linus' of git://neil.brown.name/md ... Browse Code »

* 'for-linus' of git://neil.brown.name/md: (24 commits)
md: clean up do_md_stop
md: fix another deadlock with removing sysfs attributes.
md: move revalidate_disk() back outside open_mutex
md/raid10: fix deadlock with unaligned read during resync
md/bitmap: separate out loading a bitmap from initialising the structures.
md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
md/bitmap: optimise scanning of empty bitmaps.
md/bitmap: clean up plugging calls.
md/bitmap: reduce dependence on sysfs.
md/bitmap: white space clean up and similar.
md/raid5: export raid5 unplugging interface.
md/plug: optionally use plugger to unplug an array during resync/recovery.
md/raid5: add simple plugging infrastructure.
md/raid5: export is_congested test
raid5: Don't set read-ahead when there is no queue
md: add support for raising dm events.
md: export various start/stop interfaces
md: split out md_rdev_init
md: be more careful setting MD_CHANGE_CLEAN
md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk
...

Linus Torvalds
2010-08-11 06:38:19 +0800

08 Aug, 2010

2 commits

bb4f1e9d0 md: fix another deadlock with removing sysfs attributes. ... Browse Code »

Move the deletion of sysfs attributes from reconfig_mutex to
open_mutex didn't really help as a process can try to take
open_mutex while holding reconfig_mutex, so the same deadlock can
happen, just requiring one more process to be involved in the chain.

I looks like I cannot easily use locking to wait for the sysfs
deletion to complete, so don't.

The only things that we cannot do while the deletions are still
pending is other things which can change the sysfs namespace: run,
takeover, stop. Each of these can fail with -EBUSY.
So set a flag while doing a sysfs deletion, and fail run, takeover,
stop if that flag is set.

This is suitable for 2.6.35.x

Cc: stable@kernel.org
Signed-off-by: NeilBrown

NeilBrown
2010-08-08 19:21:27 +0800
7b6d91dae block: unify flags for struct bio and struct request ... Browse Code »

Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2010-08-08 00:20:39 +0800

26 Jul, 2010

8 commits

e384e5854 md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. ... Browse Code »

This allows md/raid5 to fully work as a dm target.

Normally md uses a 'filemap' which contains a list of pages of bits
each of which may be written separately.
dm-log uses and all-or-nothing approach to writing the log, so
when using a dm-log, ->filemap is NULL and the flags normally stored
in filemap_attr are stored in ->logattrs instead.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 11:21:34 +0800
b63d7c2e2 md/bitmap: clean up plugging calls. ... Browse Code »

1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
arrays with no queue attached.

2/ Don't bother plugging the queue when we set a bit in the bitmap.
The reason for this was to encourage as many bits as possible to
get set before we unplug and write stuff out.
However every personality already plugs the queue after
bitmap_startwrite either directly (raid1/raid10) or be setting
STRIPE_BIT_DELAY which causes the queue to be plugged later
(raid5).

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 11:21:32 +0800
ac2f40be4 md/bitmap: white space clean up and similar. ... Browse Code »

Fixes some whitespace problems
Fixed some checkpatch.pl complaints.
Replaced kmalloc ... memset(0), with kzalloc
Fixed an unlikely memory leak on an error path.
Reformatted a number of 'if/else' sets, sometimes
replacing goto with an else clause.
Removed some old comments and commented-out code.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 11:07:22 +0800
252ac5221 md/plug: optionally use plugger to unplug an array during resync/recovery. ... Browse Code »

If an array doesn't have a 'queue' then md_do_sync cannot
unplug it.
In that case it will have a 'plugger', so make that available
to the mddev, and use it to unplug the array if needed.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 10:53:08 +0800
2ac874015 md/raid5: add simple plugging infrastructure. ... Browse Code »

md/raid5 uses the plugging infrastructure provided by the block layer
and 'struct request_queue'. However when we plug raid5 under dm there
is no request queue so we cannot use that.

So create a similar infrastructure that is much lighter weight and use
it for raid5.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 10:53:08 +0800
768a418db md: add support for raising dm events. ... Browse Code »

dm uses scheduled work to raise events to user-space.
So allow md device to have work_structs and schedule them on an error.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 10:52:27 +0800
390ee602a md: export various start/stop interfaces ... Browse Code »

export entry points for starting and stopping md arrays.
This will be used by a module to make md/raid5 work under
dm.
Also stop calling md_stop_writes from md_stop, as that won't
work well with dm - it will want to call the two separately.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 10:52:27 +0800
e8bb9a839 md: split out md_rdev_init ... Browse Code »

This functionality will be needed separately in a subsequent patch, so
split it into it's own exported function.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 10:52:27 +0800

21 Jul, 2010

1 commit

00bcb4ac7 md: reduce dependence on sysfs. ... Browse Code »

We will want md devices to live as dm targets where sysfs is not
visible. So allow md to not connect to sysfs.

Signed-off-by: NeilBrown

NeilBrown
2010-07-21 11:27:53 +0800

24 Jun, 2010

1 commit

e93f68a1f md: fix handling of array level takeover that re-arranges devices. ... Browse Code »

Most array level changes leave the list of devices largely unchanged,
possibly causing one at the end to become redundant.
However conversions between RAID0 and RAID10 need to renumber
all devices (except 0).

This renumbering is currently being done in the ->run method when the
new personality takes over. However this is too late as the common
code in md.c might already have invalidated some of the devices if
they had a ->raid_disk number that appeared to high.

Moving it into the ->takeover method is too early as the array is
still active at that time and wrong ->raid_disk numbers could cause
confusion.

So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
the new raid_disk number.
Now the common code knows exactly which devices need to be renumbered,
and which can be invalidated, and can do it all at a convenient time
when the array is suspend.
It can also update some symlinks in sysfs which previously were not be
updated correctly.

Reported-by: Maciej Trela
Signed-off-by: NeilBrown

NeilBrown
2010-06-24 11:33:24 +0800

18 May, 2010

5 commits

a8707c08f md: simplify updating of event count to sometimes avoid updating spares. ... Browse Code »

When updating the event count for a simple clean dirty transition,
we try to avoid updating the spares so they can safely spin-down.
As the event_counts across an array must be +/- 1, this means
decrementing the event_count on a dirty->clean transition.
This is not always safe and we have to avoid the unsafe time.
We current do this with a misguided idea about it being safe or
not depending on whether the event_count is odd or even. This
approach only works reliably in a few common instances, but easily
falls down.

So instead, simply keep internal state concerning whether it is safe
or not, and always assume it is not safe when an array is first
assembled.

Signed-off-by: NeilBrown

NeilBrown
2010-05-18 13:28:01 +0800
21a52c6d0 md: pass mddev to make_request functions rather than request_queue ... Browse Code »

We used to pass the personality make_request function direct
to the block layer so the first argument had to be a queue.
But now we have the intermediary md_make_request so it makes
at lot more sense to pass a struct mddev_s.
It makes it possible to have an mddev without its own queue too.

Signed-off-by: NeilBrown

NeilBrown
2010-05-18 13:27:55 +0800
b821eaa57 md: remove ->changed and related code. ... Browse Code »

We set ->changed to 1 and call check_disk_change at the end
of md_open so that bd_invalidated would be set and thus
partition rescan would happen appropriately.

Now that we call revalidate_disk directly, which sets bd_invalidates,
that indirection is no longer needed and can be removed.

Signed-off-by: NeilBrown

NeilBrown
2010-05-18 13:27:53 +0800
c0cc75f84 md: discard StateChanged device flag. ... Browse Code »

This was needed when sysfs files could only be 'notified'
from process context. Now that we have sys_notify_direct,
we can call it directly from an interrupt.

Signed-off-by: NeilBrown

NeilBrown
2010-05-18 13:27:47 +0800
ee8b81b03 md: remove some dead fields from mddev_s ... Browse Code »

These fields have never been used.
commit 4b6d287f627b5fb6a49f78f9e81649ff98c62bb7
added them, but also added identical files to bitmap_super_s,
and only used the latter.

So remove these unused fields.

Signed-off-by: NeilBrown

NeilBrown
2010-05-18 13:27:45 +0800

17 May, 2010

1 commit

a64c876fd md: manage redundancy group in sysfs when changing level. ... Browse Code »

Some levels expect the 'redundancy group' to be present,
others don't.
So when we change level of an array we might need to
add or remove this group.

This requires fixing up the current practice of overloading ->private
to indicate (when ->pers == NULL) that something needs to be removed.
So create a new ->to_remove to fill that role.

When changing levels, we may need to add or remove attributes. When
changing RAID5 -> RAID6, we both add and remove the same thing. It is
important to catch this and optimise it out as the removal is delayed
until a lock is released, so trying to add immediately would cause
problems.

Cc: stable@kernel.org
Signed-off-by: NeilBrown

NeilBrown
2010-05-17 12:45:40 +0800

14 Dec, 2009

9 commits

1e50915fe raid: improve MD/raid10 handling of correctable read errors. ... Browse Code »

We've noticed severe lasting performance degradation of our raid
arrays when we have drives that yield large amounts of media errors.
The raid10 module will queue each failed read for retry, and also
will attempt call fix_read_error() to perform the read recovery.
Read recovery is performed while the array is frozen, so repeated
recovery attempts can degrade the performance of the array for
extended periods of time.

With this patch I propose adding a per md device max number of
corrected read attempts. Each rdev will maintain a count of
read correction attempts in the rdev->read_errors field (not
used currently for raid10). When we enter fix_read_error()
we'll check to see when the last read error occurred, and
divide the read error count by 2 for every hour since the
last read error. If at that point our read error count
exceeds the read error threshold, we'll fail the raid device.

In addition in this patch I add sysfs nodes (get/set) for
the per md max_read_errors attribute, the rdev->read_errors
attribute, and added some printk's to indicate when
fix_read_error fails to repair an rdev.

For testing I used debugfs->fail_make_request to inject
IO errors to the rdev while doing IO to the raid array.

Signed-off-by: Robert Becker
Signed-off-by: NeilBrown

Robert Becker
2009-12-14 09:51:41 +0800
ece5cff0d md: Support write-intent bitmaps with externally managed metadata. ... Browse Code »

In this case, the metadata needs to not be in the same
sector as the bitmap.
md will not read/write any bitmap metadata. Config must be
done via sysfs and when a recovery makes the array non-degraded
again, writing 'true' to 'bitmap/can_clear' will allow bits in
the bitmap to be cleared again.

Signed-off-by: NeilBrown

NeilBrown
2009-12-14 09:51:41 +0800
43a705076 md: support updating bitmap parameters via sysfs. ... Browse Code »

A new attribute directory 'bitmap' in 'md' is created which
contains files for configuring the bitmap.
'location' identifies where the bitmap is, either 'none',
or 'file' or 'sector offset from metadata'.
Writing 'location' can create or remove a bitmap.
Adding a 'file' bitmap this way is not yet supported.
'chunksize' and 'time_base' must be set before 'location'
can be set.

'chunksize' can be set before creating a bitmap, but is
currently always over-ridden by the bitmap superblock.

'time_base' and 'backlog' can be updated at any time.

Signed-off-by: NeilBrown
Reviewed-by: Andre Noll

NeilBrown
2009-12-14 09:51:41 +0800
72e02075a md: factor out parsing of fixed-point numbers ... Browse Code »

safe_delay_store can parse fixed point numbers (for fractions
of a second). We will want to do that for another sysfs
file soon, so factor out the code.

Signed-off-by: NeilBrown

NeilBrown
2009-12-14 09:51:41 +0800
f6af949c5 md: support bitmap offset appropriate for external-metadata arrays. ... Browse Code »

For md arrays were metadata is managed externally, the kernel does not
know about a superblock so the superblock offset is 0.
If we want to have a write-intent-bitmap near the end of the
devices of such an array, we should support sector_t sized offset.
We need offset be possibly negative for when the bitmap is before
the metadata, so use loff_t instead.

Also add sanity check that bitmap does not overlap with data.

Signed-off-by: NeilBrown

NeilBrown
2009-12-14 09:51:41 +0800
42a04b507 md: move offset, daemon_sleep and chunksize out of bitmap structure ... Browse Code »

... and into bitmap_info. These are all configuration parameters
that need to be set before the bitmap is created.

Signed-off-by: NeilBrown

NeilBrown
2009-12-14 09:51:41 +0800
c3d9714e8 md: collect bitmap-specific fields into one structure. ... Browse Code »

In preparation for making bitmap fields configurable via sysfs,
start tidying up by making a single structure to contain the
configuration fields.

Signed-off-by: NeilBrown

NeilBrown
2009-12-14 09:51:41 +0800
a2826aa92 md: support barrier requests on all personalities. ... Browse Code »

Previously barriers were only supported on RAID1. This is because
other levels requires synchronisation across all devices and so needed
a different approach.
Here is that approach.

When a barrier arrives, we send a zero-length barrier to every active
device. When that completes - and if the original request was not
empty - we submit the barrier request itself (with the barrier flag
cleared) and then submit a fresh load of zero length barriers.

The barrier request itself is asynchronous, but any subsequent
request will block until the barrier completes.

The reason for clearing the barrier flag is that a barrier request is
allowed to fail. If we pass a non-empty barrier through a striping
raid level it is conceivable that part of it could succeed and part
could fail. That would be way too hard to deal with.
So if the first run of zero length barriers succeed, we assume all is
sufficiently well that we send the request and ignore errors in the
second run of barriers.

RAID5 needs extra care as write requests may not have been submitted
to the underlying devices yet. So we flush the stripe cache before
proceeding with the barrier.

Note that the second set of zero-length barriers are submitted
immediately after the original request is submitted. Thus when
a personality finds mddev->barrier to be set during make_request,
it should not return from make_request until the corresponding
per-device request(s) have been queued.

That will be done in later patches.

Signed-off-by: NeilBrown
Reviewed-by: Andre Noll

NeilBrown
2009-12-14 09:49:49 +0800
aa5cbd103 md/bitmap: protect against bitmap removal while being updated. ... Browse Code »

A write intent bitmap can be removed from an array while the
array is active.
When this happens, all IO is suspended and flushed before the
bitmap is removed.
However it is possible that bitmap_daemon_work is still running to
clear old bits from the bitmap. If it is, it can dereference the
bitmap after it has been freed.

So introduce a new mutex to protect bitmap_daemon_work and get it
before destroying a bitmap.

This is suitable for any current -stable kernel.

Signed-off-by: NeilBrown
Cc: stable@kernel.org

NeilBrown
2009-12-14 09:49:46 +0800

23 Sep, 2009

1 commit

3fa841d7e md: report device as congested when suspended ... Browse Code »

This should writeback from coming when the device is temporarily
suspended.

Signed-off-by: NeilBrown

NeilBrown
2009-09-23 16:10:29 +0800

21 Sep, 2009

1 commit

411c94038 trivial: fix typo "for for" in multiple files ... Browse Code »

trivial: fix typo "for for" in multiple files

Signed-off-by: Anand Gadiyar
Signed-off-by: Jiri Kosina

Anand Gadiyar
2009-09-21 21:14:54 +0800

10 Aug, 2009

1 commit

c8c00a691 Remove deadlock potential in md_open ... Browse Code »

A recent commit:
commit 449aad3e25358812c43afc60918c5ad3819488e7

introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.

__blkdev_get holds bd_mutex while calling md_open which takes
reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
takes bd_mutex in the call the revalidate_disk.

This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
commit d63a5a74dee87883fda6b7d170244acaac5b05e8
do avoid a warning of an impossible deadlock.

It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL. So we can be certain that no IO is in flight as the
array is being destroyed.

So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.

This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d

Reported-by: "Mike Snitzer"
Reported-by: "H. Peter Anvin"
Signed-off-by: NeilBrown

NeilBrown
2009-08-10 10:50:52 +0800

03 Aug, 2009

1 commit

ac5e7113e md: Push down data integrity code to personalities. ... Browse Code »

This patch replaces md_integrity_check() by two new public functions:
md_integrity_register() and md_integrity_add_rdev() which are both
personality-independent.

md_integrity_register() is called from the ->run and ->hot_remove
methods of all personalities that support data integrity. The
function iterates over the component devices of the array and
determines if all active devices are integrity capable and if their
profiles match. If this is the case, the common profile is registered
for the mddev via blk_integrity_register().

The second new function, md_integrity_add_rdev() is called from the
->hot_add_disk methods, i.e. whenever a new device is being added
to a raid array. If the new device does not support data integrity,
or has a profile different from the one already registered, data
integrity for the mddev is disabled.

For raid0 and linear, only the call to md_integrity_register() from
the ->run method is necessary.

Signed-off-by: Andre Noll
Signed-off-by: NeilBrown

Andre Noll
2009-08-03 08:59:47 +0800

18 Jun, 2009

6 commits

0894cc306 md: Move check for bitmap presence to personality code. ... Browse Code »

If the superblock of a component device indicates the presence of a
bitmap but the corresponding raid personality does not support bitmaps
(raid0, linear, multipath, faulty), then something is seriously wrong
and we'd better refuse to run such an array.

Currently, this check is performed while the superblocks are examined,
i.e. before entering personality code. Therefore the generic md layer
must know which raid levels support bitmaps and which do not.

This patch avoids this layer violation without adding identical code
to various personalities. This is accomplished by introducing a new
public function to md.c, md_check_no_bitmap(), which replaces the
hard-coded checks in the superblock loading functions.

A call to md_check_no_bitmap() is added to the ->run method of each
personality which does not support bitmaps and assembly is aborted
if at least one component device contains a bitmap.

Signed-off-by: Andre Noll
Signed-off-by: NeilBrown

Andre Noll
2009-06-18 06:49:23 +0800
8190e754e md: remove chunksize rounding from common code. ... Browse Code »

It is easiest to round sizes to multiples of chunk size in
the personality code for those personalities which care.
Those personalities now do the rounding, so we can
remove that function from common code.

Also remove the upper bound on the size of a chunk, and the lower
bound on the size of a device (1 chunk), neither of which really buy
us anything.

Signed-off-by: NeilBrown

NeilBrown
2009-06-18 06:48:58 +0800
50ac168a6 md: merge reconfig and check_reshape methods. ... Browse Code »

The difference between these two methods is artificial.
Both check that a pending reshape is valid, and perform any
aspect of it that can be done immediately.
'reconfig' handles chunk size and layout.
'check_reshape' handles raid_disks.

So make them just one method.

Signed-off-by: NeilBrown

NeilBrown
2009-06-18 06:47:55 +0800
597a711b6 md: remove unnecessary arguments from ->reconfig method. ... Browse Code »

Passing the new layout and chunksize as args is not necessary as
the mddev has fields for new_check and new_layout.

This is preparation for combining the check_reshape and reconfig
methods

Signed-off-by: NeilBrown

NeilBrown
2009-06-18 06:47:42 +0800
664e7c413 md: Convert mddev->new_chunk to sectors. ... Browse Code »

A straight-forward conversion which gets rid of some
multiplications/divisions/shifts. The patch also introduces a couple
of new ones, most of which are due to conf->chunk_size still being
represented in bytes. This will be cleaned up in subsequent patches.

Signed-off-by: Andre Noll
Signed-off-by: NeilBrown

Andre Noll
2009-06-18 06:45:27 +0800
9d8f03636 md: Make mddev->chunk_size sector-based. ... Browse Code »

This patch renames the chunk_size field to chunk_sectors with the
implied change of semantics. Since

is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
= is_power_of_2(chunk_sectors)

these bits don't need an adjustment for the shift.

Signed-off-by: Andre Noll
Signed-off-by: NeilBrown

Andre Noll
2009-06-18 06:45:01 +0800

14 Apr, 2009

1 commit

63fe08177 md: tiny md.h cleanups ... Browse Code »

- update inclusion guard and make sure it covers the whole file
- remove superflous #ifdef CONFIG_BLOCK
- make sure all required headers are included so that new users aren't
required to include others before

Signed-off-by: Christoph Hellwig
Signed-off-by: NeilBrown

Christoph Hellwig
2009-04-14 10:01:53 +0800