Eric Lee / smarc-fsl-linux-kernel

23 Dec, 2011

4 commits

9a3e1101b md/raid5: detect and handle replacements during recovery. ... Browse Code »
43

During recovery we want to write to the replacement but not
the original. So we have two new flags
- R5_NeedReplace if this stripe has a replacement that needs to
be written at some stage
- R5_WantReplace if NeedReplace, and the data is available, and
a 'sync' has been requested on this stripe.

We also distinguish between 'sync and replace' which need to read
all other devices, and 'replace' which only needs to read the
devices being replaced.

Note that during resync we always write to any replacement device.
It might not need to be written to, but as we don't read to compare,
we have to write to be sure.

Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:53 +0800
977df3625 md/raid5: writes should get directed to replacement as well as original. ... Browse Code »

When writing, we need to submit two writes, one to the original, and
one to the replacement - if there is a replacement.

If the write to the replacement results in a write error, we just fail
the device. We only try to record write errors to the original.

When writing for recovery, we shouldn't write to the original. This
will be addressed in a subsequent patch that generally addresses
recovery.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:53 +0800
ede7ee8b4 md/raid5: raid5.h cleanup ... Browse Code »

Remove some #defines that are no longer used, and replace some
others with an enum.
And remove an unused field.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:52 +0800
671488cc2 md/raid5: allow each slot to have an extra replacement device ... Browse Code »

Just enhance data structures to record a second device per slot to be
used as a 'replacement' device, replacing the original.
We also have a second bio in each slot in each stripe_head. This will
only be used when writing to the array - we need to write to both the
original and the replacement at the same time, so will need two bios.

For now, only try using the replacement drive for aligned-reads.
In this case, we prefer the replacement if it has been recovered far
enough, otherwise use the original.

This includes a small enhancement. Previously we would only do
aligned reads if the target device was fully recovered. Now we also
do them if it has recovered far enough.

Reviewed-by: Dan Williams
Signed-off-by: NeilBrown

NeilBrown
2011-12-23 07:17:52 +0800

11 Oct, 2011

4 commits

d1688a6d5 md/raid5: typedef removal: raid5_conf_t -> struct r5conf ... Browse Code »

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:49:52 +0800
2b8bf3451 md: remove typedefs: mdk_thread_t -> struct md_thread ... Browse Code »

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:48:23 +0800
fd01b88c7 md: remove typedefs: mddev_t -> struct mddev ... Browse Code »

Having mddev_t and 'struct mddev_s' is ugly and not preferred

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:47:53 +0800
3cb030020 md: removing typedefs: mdk_rdev_t -> struct md_rdev ... Browse Code »

The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.

Signed-off-by: NeilBrown

NeilBrown
2011-10-11 13:45:26 +0800

28 Jul, 2011

3 commits

b84db560e md/raid5: Clear bad blocks on successful write. ... Browse Code »

On a successful write to a known bad block, flag the sh
so that raid5d can remove the known bad block from the list.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:39:23 +0800
bc2607f39 md/raid5: write errors should be recorded as bad blocks if possible. ... Browse Code »

When a write error is detected, don't mark the device as failed
immediately but rather record the fact for handle_stripe to deal with.

Handle_stripe then attempts to record a bad block. Only if that fails
does the device get marked as faulty.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:39:22 +0800
7f0da59bd md/raid5: use bad-block log to improve handling of uncorrectable read errors. ... Browse Code »

If we get an uncorrectable read error - record a bad block rather than
failing the device.
And if these errors (which may be due to known bad blocks) cause
recovery to be impossible, record a bad block on the recovering
devices, or abort the recovery.

As we might abort a recovery without failing a device we need to teach
RAID5 about recovery_disabled handling.

Signed-off-by: NeilBrown

NeilBrown
2011-07-28 09:39:22 +0800

26 Jul, 2011

4 commits

c5709ef6a md/raid5: add some more fields to stripe_head_state ... Browse Code »

Adding these three fields will allow more common code to be moved
to handle_stripe()

struct field rearrangement by Namhyung Kim.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-26 09:35:20 +0800
f2b3b44de md/raid5: unify stripe_head_state and r6_state ... Browse Code »

'struct stripe_head_state' stores state about the 'current' stripe
that is passed around while handling the stripe.
For RAID6 there is an extension structure: r6_state, which is also
passed around.
There is no value in keeping these separate, so move the fields from
the latter into the former.

This means that all code now needs to treat s->failed_num as an small
array, but this is a small cost.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-26 09:35:19 +0800
c4c1663be md/raid5: replace sh->lock with an 'active' flag. ... Browse Code »

sh->lock is now mainly used to ensure that two threads aren't running
in the locked part of handle_stripe[56] at the same time.

That can more neatly be achieved with an 'active' flag which we set
while running handle_stripe. If we find the flag is set, we simply
requeue the stripe for later by setting STRIPE_HANDLE.

For safety we take ->device_lock while examining the state of the
stripe and creating a summary in 'stripe_head_state / r6_state'.
This possibly isn't needed but as shared fields like ->toread,
->towrite are checked it is safer for now at least.

We leave the label after the old 'unlock' called "unlock" because it
will disappear in a few patches, so renaming seems pointless.

This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
later, but that is not a problem.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-26 09:34:20 +0800
83206d66b md/raid5: Remove use of sh->lock in sync_request ... Browse Code »

This is the start of a series of patches to remove sh->lock.

sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
there is no race with testing it in handle_stripe[56].

Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
in handle_stripe[56] (after getting the same lock) and perform the
same set/clear operations if it was set.

Signed-off-by: NeilBrown
Reviewed-by: Namhyung Kim

NeilBrown
2011-07-26 09:19:49 +0800

18 Apr, 2011

1 commit

482c08349 md - remove old plugging code. ... Browse Code »

md has some plugging infrastructure for RAID5 to use because the
normal plugging infrastructure required a 'request_queue', and when
called from dm, RAID5 doesn't have one of those available.

This relied on the ->unplug_fn callback which doesn't exist any more.

So remove all of that code, both in md and raid5. Subsequent patches
with restore the plugging functionality.

Signed-off-by: NeilBrown

NeilBrown
2011-04-18 16:25:42 +0800

10 Mar, 2011

1 commit

7eaceacca block: remove per-queue plugging ... Browse Code »
86

Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().

Signed-off-by: Jens Axboe

Jens Axboe
2011-03-10 15:52:07 +0800

10 Sep, 2010

1 commit

e9c7469bb md: implment REQ_FLUSH/FUA support ... Browse Code »

This patch converts md to support REQ_FLUSH/FUA instead of now
deprecated REQ_HARDBARRIER. In the core part (md.c), the following
changes are notable.

* Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
processing of other requests and thus there is no reason to mark the
queue congested while FLUSH/FUA is in progress.

* REQ_FLUSH/FUA failures are final and its users don't need retry
logic. Retry logic is removed.

* Preflush needs to be issued to all member devices but FUA writes can
be handled the same way as other writes - their processing can be
deferred to request_queue of member devices. md_barrier_request()
is renamed to md_flush_request() and simplified accordingly.

For linear, raid0 and multipath, the core changes are enough. raid1,
5 and 10 need the following conversions.

* raid1: Handling of FLUSH/FUA bio's can simply be deferred to
request_queues of member devices. Barrier related logic removed.

* raid5: Queue draining logic dropped. FUA bit is propagated through
biodrain and stripe resconstruction such that all the updated parts
of the stripe are written out with FUA writes if any of the dirtying
writes was FUA. preread_active_stripes handling in make_request()
is updated as suggested by Neil Brown.

* raid10: FUA bit needs to be propagated to write clones.

linear, raid0, 1, 5 and 10 tested.

Signed-off-by: Tejun Heo
Reviewed-by: Neil Brown
Signed-off-by: Jens Axboe

Tejun Heo
2010-09-10 18:35:38 +0800

26 Jul, 2010

4 commits

9f7c22200 md/raid5: export raid5 unplugging interface. ... Browse Code »

Also remove remaining accesses to ->queue and ->gendisk when ->queue
is NULL (As it is in a DM target).

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 10:53:10 +0800
2ac874015 md/raid5: add simple plugging infrastructure. ... Browse Code »

md/raid5 uses the plugging infrastructure provided by the block layer
and 'struct request_queue'. However when we plug raid5 under dm there
is no request queue so we cannot use that.

So create a similar infrastructure that is much lighter weight and use
it for raid5.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 10:53:08 +0800
11d8a6e37 md/raid5: export is_congested test ... Browse Code »
43

the dm module will need this for dm-raid45.

Also only access ->queue->backing_dev_info->congested_fn
if ->queue actually exists. It won't in a dm target.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 10:52:29 +0800
f4be6b43f md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk ... Browse Code »

We will shortly allow md devices with no gendisk (they are attached to
a dm-target instead). That will cause mdname() to return 'mdX'.
There is one place where mdname really needs to be unique: when
creating the name for a slab cache.
So in that case, if there is no gendisk, you the address of the mddev
formatted in HEX to provide a unique name.

Signed-off-by: NeilBrown

NeilBrown
2010-07-26 10:52:26 +0800

21 Jul, 2010

1 commit

c41d4ac40 md/raid5: factor out code for changing size of stripe cache. ... Browse Code »

Separate the actual 'change' code from the sysfs interface
so that it can eventually be called internally.

Signed-off-by: NeilBrown

NeilBrown
2010-07-21 11:28:15 +0800

17 Feb, 2010

1 commit

a29d8b8e2 percpu: add __percpu sparse annotations to what's left ... Browse Code »

Add __percpu sparse annotations to places which didn't make it in one
of the previous patches. All converions are trivial.

These annotations are to make sparse consider percpu variables to be
in a different address space and warn if accessed without going
through percpu accessors. This patch doesn't affect normal builds.

Signed-off-by: Tejun Heo
Acked-by: Borislav Petkov
Cc: Dan Williams
Cc: Huang Ying
Cc: Len Brown
Cc: Neil Brown

Tejun Heo
2010-02-17 10:17:38 +0800

16 Oct, 2009

2 commits

e4424fee1 md: fix problems with RAID6 calculations for DDF. ... Browse Code »

Signed-off-by: NeilBrown

NeilBrown
2009-10-16 13:27:34 +0800
417b8d4ac md/raid456: downlevel multicore operations to raid_run_ops ... Browse Code »

The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%). However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe(). As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).

By downleveling the parallelism to raid_run_ops the pathological
stripe_head bouncing is eliminated. This version still exhibits an
average 11% throughput loss for:

mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
echo 1024 > /sys/block/md0/md/stripe_cache_size
dd if=/dev/zero of=/dev/md0 bs=1024k count=2048

...but the results are at least stable and can be used as a base for
further multicore experimentation.

Reported-by: Holger Kiehl
Signed-off-by: Dan Williams
Signed-off-by: NeilBrown

Dan Williams
2009-10-16 13:25:22 +0800

09 Sep, 2009

1 commit

bbb20089a Merge branch 'dmaengine' into async-tx-next ... Browse Code »

Conflicts:
crypto/async_tx/async_xor.c
drivers/dma/ioat/dma_v2.h
drivers/dma/ioat/pci.c
drivers/md/raid5.c

Dan Williams
2009-09-09 08:55:21 +0800

30 Aug, 2009

4 commits

ac6b53b6e md/raid6: asynchronous raid6 operations ... Browse Code »

[ Based on an original patch by Yuri Tikhonov ]

The raid_run_ops routine uses the asynchronous offload api and
the stripe_operations member of a stripe_head to carry out xor+pq+copy
operations asynchronously, outside the lock.

The operations performed by RAID-6 are the same as in the RAID-5 case
except for no support of STRIPE_OP_PREXOR operations. All the others
are supported:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate missing blocks (1 or 2) in the cache from the other blocks
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_RECONSTRUCT
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct

The flow is the same as in the RAID-5 case, and reuses some routines, namely:
1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
2/ ops_complete_compute (updated to set up to 2 targets uptodate)
3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)

[neilb@suse.de: fixes to get it to pass mdadm regression suite]
Reviewed-by: Andre Noll
Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:13:12 +0800
ad283ea4a async_tx: add sum check flags ... Browse Code »

Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.

Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:26 +0800
d6f38f31f md/raid5,6: add percpu scribble region for buffer lists ... Browse Code »

Use percpu memory rather than stack for storing the buffer lists used in
parity calculations. Include space for dma address conversions and pass
that to async_tx via the async_submit_ctl.scribble pointer.

[ Impact: move memory pressure from stack to heap ]

Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:26 +0800
36d1c6476 md/raid6: move the spare page to a percpu allocation ... Browse Code »

In preparation for asynchronous handling of raid6 operations move the
spare page to a percpu allocation to allow multiple simultaneous
synchronous raid6 recovery operations.

Make this allocation cpu hotplug aware to maximize allocation
efficiency.

Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:26 +0800

18 Jun, 2009

1 commit

09c9e5fa1 md: convert conf->chunk_size and conf->prev_chunk to sectors. ... Browse Code »

This kills some more shifts.

Signed-off-by: Andre Noll
Signed-off-by: NeilBrown

Andre Noll
2009-06-18 06:45:55 +0800

16 Jun, 2009

1 commit

070ec55d0 md: remove mddev_to_conf "helper" macro ... Browse Code »

Having a macro just to cast a void* isn't really helpful.
I would must rather see that we are simply de-referencing ->private,
than have to know what the macro does.

So open code the macro everywhere and remove the pointless cast.

Signed-off-by: NeilBrown

NeilBrown
2009-06-16 14:54:21 +0800

31 Mar, 2009

7 commits

c8f517c44 md/raid5 revise rules for when to update metadata during reshape ... Browse Code »

We currently update the metadata :
1/ every 3Megabytes
2/ When the place we will write new-layout data to is recorded in
the metadata as still containing old-layout data.

Rule one exists to avoid having to re-do too much reshaping in the
face of a crash/restart. So it should really be time based rather
than size based. So change it to "every 10 seconds".

Rule two turns out to be too harsh when restriping an array
'in-place', as in that case the metadata much be updates for every
stripe.
For the in-place update, it can only possibly be safe from a crash if
some user-space program data a backup of every e.g. few hundred
stripes before allowing them to be reshaped. In that case, the
constant metadata update is pointless.
So only update the metadata if the new metadata will report that the
end of the 'old-layout' data is beyond where we are currently
writing 'new-layout' data.

Signed-off-by: NeilBrown

NeilBrown
2009-03-31 12:28:40 +0800
e183eaedd md/raid5: prepare for allowing reshape to change layout ... Browse Code »

Add prev_algo to raid5_conf_t along the same lines as prev_chunk
and previous_raid_disks.

Signed-off-by: NeilBrown

NeilBrown
2009-03-31 12:20:22 +0800
784052ecc md/raid5: prepare for allowing reshape to change chunksize. ... Browse Code »

Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to
remember what the chunk size was before the reshape that is currently
underway.

This seems like duplication with "chunk_size" and "new_chunk" in
mddev_t, and to some extent it is, but there are differences.
The values in mddev_t are always defined and often the same.
The prev* values are only defined if a reshape is underway.

Also (and more significantly) the raid5_conf_t values will be changed
at the same time (inside an appropriate lock) that the reshape is
started by setting reshape_position. In contrast, the new_chunk value
is set when the sysfs file is written which could be well before the
reshape starts.

Signed-off-by: NeilBrown

NeilBrown
2009-03-31 12:19:07 +0800
86b42c713 md/raid5: clearly differentiate 'before' and 'after' stripes during reshape. ... Browse Code »

During a raid5 reshape, we have some stripes in the cache that are
'before' the reshape (and are still to be processed) and some that are
'after'. They are currently differentiated by having different
->disks values as the only reshape current supported involves changing
the number of disks.

However we will soon support reshapes that do not change the number
of disks (chunk parity or chunk size). So make the difference more
explicit with a 'generation' number.

Signed-off-by: NeilBrown

NeilBrown
2009-03-31 12:19:03 +0800
fef9c61fd md/raid5: change reshape-progress measurement to cope with reshaping backwards. ... Browse Code »

When reducing the number of devices in a raid4/5/6, the reshape
process has to start at the end of the array and work down to the
beginning. So we need to handle expand_progress and expand_lo
differently.

This patch renames "expand_progress" and "expand_lo" to avoid the
implication that anything is getting bigger (expand->reshape) and
every place they are used, we make sure that they are used the right
way depending on whether delta_disks is positive or negative.

Signed-off-by: NeilBrown

NeilBrown
2009-03-31 12:16:46 +0800
34e04e87f md/raid5: drop qd_idx from r6_state ... Browse Code »

We now have this value in stripe_head so we don't need to duplicate
it.

Signed-off-by: NeilBrown

NeilBrown
2009-03-31 12:10:16 +0800
f701d589a md/raid6: move raid6 data processing to raid6_pq.ko ... Browse Code »

Move the raid6 data processing routines into a standalone module
(raid6_pq) to prepare them to be called from async_tx wrappers and other
non-md drivers/modules. This precludes a circular dependency of raid456
needing the async modules for data processing while those modules in
turn depend on raid456 for the base level synchronous raid6 routines.

To support this move:
1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
2/ The raid6_call, recovery calls, and table symbols are exported
3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
compile

Signed-off-by: Dan Williams
Signed-off-by: NeilBrown

Dan Williams
2009-03-31 12:09:39 +0800