Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

14 Oct, 2014

2 commits

f72ffdd68 md: remove unwanted white space from md.c ... Browse Code »

My editor shows much of this is RED.

Signed-off-by: NeilBrown

NeilBrown
2014-10-14 10:08:29 +0800
c4796e215 md/raid10: another memory leak due to reshape. ... Browse Code »

Signed-off-by: NeilBrown

NeilBrown
2014-10-14 10:08:28 +0800

09 Oct, 2014

1 commit

3fd83717e md: use set_bit/clear_bit instead of shift/mask for bi_flags changes. ... Browse Code »

Using {set,clear}_bit is more consistent than shifting and masking.

No functional change.

Signed-off-by: NeilBrown

NeilBrown
2014-10-09 07:07:04 +0800

19 Aug, 2014

4 commits

cb8b12b5d md/raid10: always initialise ->state on newly allocated r10_bio ... Browse Code »

Most places which allocate an r10_bio zero the ->state, some don't.
As the r10_bio comes from a mempool, and the allocation function uses
kzalloc it is often zero anyway. But sometimes it isn't and it is
best to be safe.

I only noticed this because of the bug fixed by an earlier patch
where the r10_bios allocated for a reshape were left around to
be used by a subsequent resync. In that case the R10BIO_IsReshape
flag caused problems.

Signed-off-by: NeilBrown

NeilBrown
2014-08-19 15:20:27 +0800
e337aead3 md/raid10: avoid memory leak on error path during reshape. ... Browse Code »

If raid10 reshape fails to find somewhere to read a block
from, it returns without freeing memory...

Signed-off-by: NeilBrown

NeilBrown
2014-08-19 15:20:27 +0800
b39685526 md/raid10: Fix memory leak when raid10 reshape completes. ... Browse Code »
6

When a raid10 commences a resync/recovery/reshape it allocates
some buffer space.
When a resync/recovery completes the buffer space is freed. But not
when the reshape completes.
This can result in a small memory leak.

There is a subtle side-effect of this bug. When a RAID10 is reshaped
to a larger array (more devices), the reshape is immediately followed
by a "resync" of the new space. This "resync" will use the buffer
space which was allocated for "reshape". This can cause problems
including a "BUG" in the SCSI layer. So this is suitable for -stable.

Cc: stable@vger.kernel.org (v3.5+)
Fixes: 3ea7daa5d7fde47cd41f4d56c2deb949114da9d6
Signed-off-by: NeilBrown

NeilBrown
2014-08-19 15:20:27 +0800
ce0b0a469 md/raid10: fix memory leak when reshaping a RAID10. ... Browse Code »
6

raid10 reshape clears unwanted bits from a bio->bi_flags using
a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC
was added.
Since then it clears that bit but shouldn't. This results in a
memory leak.

So change to used the approved method of clearing unwanted bits.

As this causes a memory leak which can consume all of memory
the fix is suitable for -stable.

Fixes: a38352e0ac02dbbd4fa464dc22d1352b5fbd06fd
Cc: stable@vger.kernel.org (v3.10+)
Reported-by: mdraid.pkoch@dfgh.net (Peter Koch)
Signed-off-by: NeilBrown

NeilBrown
2014-08-19 15:20:27 +0800

31 Jul, 2014

1 commit

2446dba03 md/raid1,raid10: always abort recover on write error. ... Browse Code »
6

Currently we don't abort recovery on a write error if the write error
to the recovering device was triggerd by normal IO (as opposed to
recovery IO).

This means that for one bitmap region, the recovery might write to the
recovering device for a few sectors, then not bother for subsequent
sectors (as it never writes to failed devices). In this case
the bitmap bit will be cleared, but it really shouldn't.

The result is that if the recovering device fails and is then re-added
(after fixing whatever hardware problem triggerred the failure),
the second recovery won't redo the region it was in the middle of,
so some of the device will not be recovered properly.

If we abort the recovery, the region being processes will be cancelled
(bit not cleared) and the whole region will be retried.

As the bug can result in data corruption the patch is suitable for
-stable. For kernels prior to 3.11 there is a conflict in raid10.c
which will require care.

Original-from: jiao hui
Reported-and-tested-by: jiao hui
Signed-off-by: NeilBrown
Cc: stable@vger.kernel.org

NeilBrown
2014-07-31 08:16:52 +0800

06 May, 2014

1 commit

cc13b1d15 md/raid10: call wait_barrier() for each request submitted. ... Browse Code »
5

wait_barrier() includes a counter, so we must call it precisely once
(unless balanced by allow_barrier()) for each request submitted.

Since
commit 20d0189b1012a37d2533a87fb451f7852f2418d1
block: Introduce new bio_split()
in 3.14-rc1, we don't call it for the extra requests generated when
we need to split a bio.

When this happens the counter goes negative, any resync/recovery will
never start, and "mdadm --stop" will hang.

Reported-by: Chris Murphy
Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1
Cc: stable@vger.kernel.org (3.14+)
Cc: Kent Overstreet
Signed-off-by: NeilBrown

NeilBrown
2014-05-06 07:49:26 +0800

31 Jan, 2014

1 commit

f568849ed Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:

- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.

- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.

- Header export fix from CaiZhiyong.

- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:

- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.

- bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...

Linus Torvalds
2014-01-31 03:19:05 +0800

14 Jan, 2014

3 commits

0b59bb642 md/raid10: avoid fullsync when not necessary. ... Browse Code »

This is the raid10 equivalent of

commit 4f0a5e012cf41321d611e7cad63e1017d143d138
MD RAID1: Further conditionalize 'fullsync'

If a device in a newly assembled array is not fully recovered we
currently do a fully resync by don't need to.

Signed-off-by: NeilBrown

NeilBrown
2014-01-14 13:44:21 +0800
e8b849158 md/raid10: fix bug when raid10 recovery fails to recover a block. ... Browse Code »

commit e875ecea266a543e643b19e44cf472f1412708f9
md/raid10 record bad blocks as needed during recovery.

added code to the "cannot recover this block" path to record a bad
block rather than fail the whole recovery.
Unfortunately this new case was placed *after* r10bio was freed rather
than *before*, yet it still uses r10bio.
This is will crash with a null dereference.

So move the freeing of r10bio down where it is safe.

Cc: stable@vger.kernel.org (v3.1+)
Fixes: e875ecea266a543e643b19e44cf472f1412708f9
Reported-by: Damian Nowak
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown

NeilBrown
2014-01-14 13:44:08 +0800
b50c259e2 md/raid10: fix two bugs in handling of known-bad-blocks. ... Browse Code »

If we discover a bad block when reading we split the request and
potentially read some of it from a different device.

The code path of this has two bugs in RAID10.
1/ we get a spin_lock with _irq, but unlock without _irq!!
2/ The calculation of 'sectors_handled' is wrong, as can be clearly
seen by comparison with raid1.c

This leads to at least 2 warnings and a probable crash is a RAID10
ever had known bad blocks.

Cc: stable@vger.kernel.org (v3.1+)
Fixes: 856e08e23762dfb92ffc68fd0a8d228f9e152160
Reported-by: Damian Nowak
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181
Signed-off-by: NeilBrown

NeilBrown
2014-01-14 13:44:07 +0800

24 Nov, 2013

4 commits

20d0189b1 block: Introduce new bio_split() ... Browse Code »
81

The new bio_split() can split arbitrary bios - it's not restricted to
single page bios, like the old bio_split() (previously renamed to
bio_pair_split()). It also has different semantics - it doesn't allocate
a struct bio_pair, leaving it up to the caller to handle completions.

Then convert the existing bio_pair_split() users to the new bio_split()
- and also nvme, which was open coding bio splitting.

(We have to take that BUG_ON() out of bio_integrity_trim() because this
bio_split() needs to use it, and there's no reason it has to be used on
bios marked as cloned; BIO_CLONED doesn't seem to have clearly
documented semantics anyways.)

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: Martin K. Petersen
Cc: Matthew Wilcox
Cc: Keith Busch
Cc: Vishal Verma
Cc: Jiri Kosina
Cc: Neil Brown

Kent Overstreet
2013-11-24 14:33:57 +0800
ee67891bf block: Rename bio_split() -> bio_pair_split() ... Browse Code »

This is prep work for introducing a more general bio_split().

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: NeilBrown
Cc: Alasdair Kergon
Cc: Lars Ellenberg
Cc: Peter Osterlund
Cc: Sage Weil

Kent Overstreet
2013-11-24 14:33:56 +0800
458b76ed2 block: Kill bio_segments()/bi_vcnt usage ... Browse Code »

When we start sharing biovecs, keeping bi_vcnt accurate for splits is
going to be error prone - and unnecessary, if we refactor some code.

So bio_segments() has to go - but most of the existing users just needed
to know if the bio had multiple segments, which is easier - add a
bio_multiple_segments() for them.

(Two of the current uses of bio_segments() are going to go away in a
couple patches, but the current implementation of bio_segments() is
unsafe as soon as we start doing driver conversions for immutable
biovecs - so implement a dumb version for bisectability, it'll go away
in a couple patches)

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: Neil Brown
Cc: Nagalakshmi Nandigama
Cc: Sreekanth Reddy
Cc: "James E.J. Bottomley"

Kent Overstreet
2013-11-24 14:33:51 +0800
4f024f379 block: Abstract out bvec iterator ... Browse Code »
13

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: Geert Uytterhoeven
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Ed L. Cashin"
Cc: Nick Piggin
Cc: Lars Ellenberg
Cc: Jiri Kosina
Cc: Matthew Wilcox
Cc: Geoff Levand
Cc: Yehuda Sadeh
Cc: Sage Weil
Cc: Alex Elder
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris
Cc: Philip Kelleher
Cc: Rusty Russell
Cc: "Michael S. Tsirkin"
Cc: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge
Cc: Neil Brown
Cc: Alasdair Kergon
Cc: Mike Snitzer
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh
Cc: Benny Halevy
Cc: "James E.J. Bottomley"
Cc: Greg Kroah-Hartman
Cc: "Nicholas A. Bellinger"
Cc: Alexander Viro
Cc: Chris Mason
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Jaegeuk Kim
Cc: Steven Whitehouse
Cc: Dave Kleikamp
Cc: Joern Engel
Cc: Prasad Joshi
Cc: Trond Myklebust
Cc: KONISHI Ryusuke
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Ben Myers
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Len Brown
Cc: Pavel Machek
Cc: "Rafael J. Wysocki"
Cc: Herton Ronaldo Krzesinski
Cc: Ben Hutchings
Cc: Andrew Morton
Cc: Guo Chao
Cc: Tejun Heo
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Cc: Wei Yongjun
Cc: "Roger Pau Monné"
Cc: Jan Beulich
Cc: Stefano Stabellini
Cc: Ian Campbell
Cc: Sebastian Ott
Cc: Christian Borntraeger
Cc: Minchan Kim
Cc: Jiang Liu
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Joe Perches
Cc: Peng Tao
Cc: Andy Adamson
Cc: fanchaoting
Cc: Jie Liu
Cc: Sunil Mushran
Cc: "Martin K. Petersen"
Cc: Namjae Jeon
Cc: Pankaj Kumar
Cc: Dan Magenheimer
Cc: Mel Gorman 6

Kent Overstreet
2013-11-24 14:33:47 +0800

21 Nov, 2013

1 commit

6d6e352c8 Merge tag 'md/3.13' of git://neil.brown.name/md ... Browse Code »

Pull md update from Neil Brown:
"Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io during
resync"

* tag 'md/3.13' of git://neil.brown.name/md:
md/raid5: Use conf->device_lock protect changing of multi-thread resources.
md/raid5: Before freeing old multi-thread worker, it should flush them.
md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
UAPI: include in linux/raid/md_p.h
raid1: Rewrite the implementation of iobarrier.
raid1: Add some macros to make code clearly.
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
raid1: Add a field array_frozen to indicate whether raid in freeze state.
md: Convert use of typedef ctl_table to struct ctl_table
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
md: fix some places where mddev_lock return value is not checked.
raid5: Retry R5_ReadNoMerge flag when hit a read error.
raid5: relieve lock contention in get_active_stripe()
raid5: relieve lock contention in get_active_stripe()
wait: add wait_event_cmd()
md/raid5.c: add proper locking to error path of raid5_start_reshape.
md: fix calculation of stacking limits on level change.
raid5: Use slow_path to release stripe when mddev->thread is null

Linus Torvalds
2013-11-21 05:05:25 +0800

19 Nov, 2013

1 commit

c91abf5a3 md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. ... Browse Code »

We currently use kthread_should_stop() in various places in the
sync/reshape code to abort early.
However some places set MD_RECOVERY_INTR but don't immediately call
md_reap_sync_thread() (and we will shortly get another one).
When this happens we are relying on md_check_recovery() to reap the
thread and that only happen when it finishes normally.
So MD_RECOVERY_INTR must lead to a normal finish without the
kthread_should_stop() test.

So replace all relevant tests, and be more careful when the thread is
interrupted not to acknowledge that latest step in a reshape as it may
not be fully committed yet.

Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
so we don't wait have to wait for the speed to drop before we can abort.

Signed-off-by: NeilBrown

NeilBrown
2013-11-19 12:19:17 +0800

09 Nov, 2013

1 commit

6678d83f1 block: Consolidate duplicated bio_trim() implementations ... Browse Code »

Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: Neil Brown
Cc: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge
Signed-off-by: Jens Axboe

Kent Overstreet
2013-11-09 00:02:31 +0800

24 Oct, 2013

1 commit

61e4947c9 md: Fix skipping recovery for read-only arrays. ... Browse Code »

Since:
commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86
md: Allow devices to be re-added to a read-only array.

spares are activated on a read-only array. In case of raid1 and raid10
personalities it causes that not-in-sync devices are marked in-sync
without checking if recovery has been finished.

If a read-only array is degraded and one of its devices is not in-sync
(because the array has been only partially recovered) recovery will be skipped.

This patch adds checking if recovery has been finished before marking a device
in-sync for raid1 and raid10 personalities. In case of raid5 personality
such condition is already present (at raid5.c:6029).

Bug was introduced in 3.10 and causes data corruption.

Cc: stable@vger.kernel.org
Signed-off-by: Pawel Baldysiak
Signed-off-by: Lukasz Dorau
Signed-off-by: NeilBrown

Lukasz Dorau
2013-10-24 09:55:17 +0800

25 Jul, 2013

1 commit

0eb25bb02 md/raid10: remove use-after-free bug. ... Browse Code »

We always need to be careful when calling generic_make_request, as it
can start a chain of events which might free something that we are
using.

Here is one place I wasn't careful enough. If the wbio2 is not in
use, then it might get freed at the first generic_make_request call.
So perform all necessary tests first.

This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an
oops, so fix is suitable for any -stable since then.

Cc: stable@vger.kernel.org (3.3+)
Signed-off-by: NeilBrown

NeilBrown
2013-07-25 14:46:53 +0800

18 Jul, 2013

1 commit

7bb23c493 md/raid10: fix two problems with RAID10 resync. ... Browse Code »

1/ When an different between blocks is found, data is copied from
one bio to the other. However bv_len is used as the length to
copy and this could be zero. So use r10_bio->sectors to calculate
length instead.
Using bv_len was probably always a bit dubious, but the introduction
of bio_advance made it much more likely to be a problem.

2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
except on bios that we schedule for a read. This ensures that
missing/failed devices don't confuse the loop at the top of
sync_request write.
Commit 8be185f2c9d54d6 "raid10: Use bio_reset()"
removed a loop which set BIO_UPTDATE on all appropriate bios.
So we need to re-add that flag.

These bugs were introduced in 3.10, so this patch is suitable for
3.10-stable, and can remove a potential for data corruption.

Cc: stable@vger.kernel.org (3.10)
Reported-by: Brassow Jonathan
Signed-off-by: NeilBrown

NeilBrown
2013-07-18 12:18:01 +0800

04 Jul, 2013

1 commit

137651206 md/raid10: fix bug which causes all RAID10 reshapes to move no data. ... Browse Code »

The recent comment:
commit 7e83ccbecd608b971f340e951c9e84cd0343002f
md/raid10: Allow skipping recovery when clean arrays are assembled

Causes raid10 to skip a recovery in certain cases where it is safe to
do so. Unfortunately it also causes a reshape to be skipped which is
never safe. The result is that an attempt to reshape a RAID10 will
appear to complete instantly, but no data will have been moves so the
array will now contain garbage.
(If nothing is written, you can recovery by simple performing the
reverse reshape which will also complete instantly).

Bug was introduced in 3.10, so this is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Cc: Martin Wilck
Signed-off-by: NeilBrown

NeilBrown
2013-07-04 14:42:57 +0800

03 Jul, 2013

1 commit

78eaa0d4c md/raid10: fix two bugs affecting RAID10 reshape. ... Browse Code »

1/ If a RAID10 is being reshaped to a fewer number of devices
and is stopped while this is ongoing, then when the array is
reassembled the 'mirrors' array will be allocated too small.
This will lead to an access error or memory corruption.

2/ A sanity test for a reshaping RAID10 array is restarted
is slightly incorrect.

Due to the first bug, this is suitable for any -stable
kernel since 3.5 where this code was introduced.

Cc: stable@vger.kernel.org (v3.5+)
Signed-off-by: NeilBrown

NeilBrown
2013-07-03 07:43:28 +0800

14 Jun, 2013

4 commits

725d6e579 md/raid10: check In_sync flag in 'enough()'. ... Browse Code »

It isn't really enough to check that the rdev is present, we need to
also be sure that the device is still In_sync.

Doing this requires using rcu_dereference to access the rdev, and
holding the rcu_read_lock() to ensure the rdev doesn't disappear while
we look at it.

Signed-off-by: NeilBrown

NeilBrown
2013-06-14 06:10:27 +0800
635f6416a md/raid10: locking changes for 'enough()'. ... Browse Code »

As 'enough' accesses conf->prev and conf->geo, which can change
spontanously, it should guard against changes.
This can be done with device_lock as start_reshape holds device_lock
while updating 'geo' and end_reshape holds it while updating 'prev'.

So 'error' needs to hold 'device_lock'.

On the other hand, raid10_end_read_request knows which of the two it
really wants to access, and as it is an active request on that one,
the value cannot change underneath it.

So change _enough to take flag rather than a pointer, pass the
appropriate flag from raid10_end_read_request(), and remove the locking.

All other calls to 'enough' are made with reconfig_mutex held, so
neither 'prev' nor 'geo' can change.

Signed-off-by: NeilBrown

NeilBrown
2013-06-14 06:10:27 +0800
9092c02d9 DM RAID: Add ability to restore transiently failed devices on resume ... Browse Code »

DM RAID: Add ability to restore transiently failed devices on resume

This patch adds code to the resume function to check over the devices
in the RAID array. If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array. This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.

Signed-off-by: Jonathan Brassow
Signed-off-by: NeilBrown

Jonathan Brassow
2013-06-14 06:10:24 +0800
82ea4be61 Merge tag 'md-3.10-fixes' of git://neil.brown.name/md ... Browse Code »

Pull md bugfixes from Neil Brown:
"A few bugfixes for md

Some tagged for -stable"

* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.

Linus Torvalds
2013-06-14 01:13:29 +0800

13 Jun, 2013

3 commits

5026d7a9b md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place ... Browse Code »

There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.

[neilb: added raid5]

This patch is appropriate for any -stable since 3.7 when write_same
support was added.

Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin
Signed-off-by: NeilBrown

H. Peter Anvin
2013-06-13 12:49:54 +0800
e2d599252 md/raid1,raid10: use freeze_array in place of raise_barrier in various places. ... Browse Code »

Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.

Using raise_barrier will sometimes deadlock. Using freeze_array
should not.

As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.

The deadlock was made particularly noticeable by commits
050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.

This patch probably won't apply directly to some early kernels and
will need to be applied by hand.

Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas
Signed-off-by: NeilBrown

NeilBrown
2013-06-13 11:40:48 +0800
3056e3aec md/raid1: consider WRITE as successful only if at least one non-Faulty and non-r… ... Browse Code »

…ebuilding drive completed it.

Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io() calls call_bio_endio()
- As a result, in call_bio_endio():
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

[neilb - applied identical change to raid10 as well]

This bug can result in lost data, so it is suitable for any
-stable kernel.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>

Alex Lyakas
2013-06-13 11:20:03 +0800

09 May, 2013

1 commit

4de13d7aa Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block core updates from Jens Axboe:

- Major bit is Kents prep work for immutable bio vecs.

- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.

- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.

- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.

- Runtime PM framework, SCSI patches exists on top of these in James'
tree.

- A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...

Linus Torvalds
2013-05-09 01:13:35 +0800

30 Apr, 2013

2 commits

32f9f570d MD: ignore discard request for hard disks of hybid raid1/raid10 array ... Browse Code »

In SSD/hard disk hybid storage, discard request should be ignored for hard
disk. We used to be doing this way, but the unplug path forgets it.

This is suitable for stable tree since v3.6.

Cc: stable@vger.kernel.org
Reported-and-tested-by: Markus
Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2013-04-30 12:49:36 +0800
0fea7ed82 md: raid1/raid10 md devices leak memory when stopping ... Browse Code »

Hi.

Raid1 and raid10 devices leak memory every time they stop.
This is a patch for linux-3.9.0-rc7 to fix this problem.

Thanks,
Hirokazu Takahashi.

Signed-off-by: Hirokazu Takahashi
Signed-off-by: NeilBrown

Hirokazu Takahashi
2013-04-30 12:49:26 +0800

24 Apr, 2013

1 commit

7e83ccbec md/raid10: Allow skipping recovery when clean arrays are assembled ... Browse Code »

When an array is assembled incrementally with mdadm -I -R
and the array switches to "active" mode, md starts a recovery.

If the array was clean, the "fullsync" flag will be 0. Skip
the full recovery in this case, as RAID1 does (the code was
actually copied from the sync_request() method of RAID1).

Signed-off-by: Martin Wilck
Signed-off-by: NeilBrown

Martin Wilck
2013-04-24 09:42:42 +0800

24 Mar, 2013

4 commits

8be185f2c raid10: Use bio_reset() ... Browse Code »

More prep work for immutable bio vecs, mainly getting rid of references
to bi_idx.

bio_reset was being open coded in a few places. The one in sync_request
was a bit nontrivial to convert, so could use some extra eyeballs.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: NeilBrown
Acked-by: NeilBrown

Kent Overstreet
2013-03-24 05:15:33 +0800
9e882242c block: Add submit_bio_wait(), remove from md ... Browse Code »
36

Random cleanup - this code was duplicated and it's not really specific
to md.

Also added the ability to return the actual error code.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: NeilBrown
Acked-by: Tejun Heo

Kent Overstreet
2013-03-24 05:15:32 +0800
4f2ac93c1 block: Remove bi_idx references ... Browse Code »

For immutable bvecs, all bi_idx usage needs to be audited - so here
we're removing all the unnecessary uses.

Most of these are places where it was being initialized on a bio that
was just allocated, a few others are conversions to standard macros.

Signed-off-by: Kent Overstreet
CC: Jens Axboe

Kent Overstreet
2013-03-24 05:15:31 +0800
5b83636ae block: Change bio_split() to respect the current value of bi_idx ... Browse Code »

In the current code bio_split() won't be seeing partially completed bios
so this doesn't change any behaviour, but this makes the code a bit
clearer as to what bio_split() actually requires.

The immediate purpose of the patch is removing unnecessary bi_idx
references, but the end goal is to allow partial completed bios to be
submitted, which along with immutable biovecs enables effecient bio
splitting.

Some of the callers were (double) checking that bios could be split, so
update their checks too.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: Lars Ellenberg
CC: Neil Brown
CC: Martin K. Petersen

Kent Overstreet
2013-03-24 05:15:30 +0800