Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

09 Apr, 2014

1 commit

da1aab3dc md/raid1: r1buf_pool_alloc: free allocate pages when subsequent allocation fails. ... Browse Code »
5

When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
on a raid1, we allocate multiple bios each with their own set of pages.

If the page allocations for one bio fails, we currently do *not* free
the pages allocated for the previous bios, nor do we free the bio itself.

This patch frees all the already-allocate pages, and makes sure that
all the bios are freed as well.

This bug can cause a memory leak which can ultimately OOM a machine.
It was introduced in 3.10-rc1.

Fixes: a07876064a0b73ab5ef1ebcf14b1cf0231c07858
Cc: Kent Overstreet
Cc: stable@vger.kernel.org (3.10+)
Reported-by: Russell King - ARM Linux
Signed-off-by: NeilBrown

NeilBrown
2014-04-09 12:42:23 +0800

05 Feb, 2014

1 commit

1877db755 md/raid1: restore ability for check and repair to fix read errors. ... Browse Code »

commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
md/raid1: fix bio handling problems in process_checks()

Move the bio_reset() to a point before where BIO_UPTODATE is checked,
so that check now always report that the bio is uptodate, even if it is not.

This causes process_check() to sometimes treat read-errors as
successful matches so the good data isn't written out.

This patch preserves the flag until it is needed.

Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
an even worse bug). So suitable for any -stable since 3.10.

Reported-and-tested-by: Michael Tokarev
Cc: stable@vger.kernel.org (3.10+)
Fixed: 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
Signed-off-by: NeilBrown

NeilBrown
2014-02-05 09:26:04 +0800

31 Jan, 2014

1 commit

f568849ed Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:

- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.

- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.

- Header export fix from CaiZhiyong.

- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:

- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.

- bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...

Linus Torvalds
2014-01-31 03:19:05 +0800

14 Jan, 2014

1 commit

41a336e01 md/raid1: fix request counting bug in new 'barrier' code. ... Browse Code »

The new iobarrier implementation in raid1 (which keeps normal writes
and resync activity separate) counts every request what is not before
the current resync point in either next_window_requests or
current_window_requests.
It flags that the request is counted by setting ->start_next_window.

allow_barrier follows this model exactly and decrements one of the
*_window_requests if and only if ->start_next_window is set.

However wait_barrier(), which increments *_window_requests uses a
slightly different test for setting -.start_next_window (which is set
from the return value of this function).
So there is a possibility of the counts getting out of sync, and this
leads to the resync hanging.

So change wait_barrier() to return a non-zero value in exactly the
same cases that it increments *_window_requests.

But was introduced in 3.13-rc1.

Reported-by: Bruno Wolff III
URL: https://bugzilla.kernel.org/show_bug.cgi?id=68061
Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
Cc: majianpeng
Signed-off-by: NeilBrown

NeilBrown
2014-01-14 13:44:07 +0800

24 Nov, 2013

1 commit

4f024f379 block: Abstract out bvec iterator ... Browse Code »
13

Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: Geert Uytterhoeven
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Ed L. Cashin"
Cc: Nick Piggin
Cc: Lars Ellenberg
Cc: Jiri Kosina
Cc: Matthew Wilcox
Cc: Geoff Levand
Cc: Yehuda Sadeh
Cc: Sage Weil
Cc: Alex Elder
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris
Cc: Philip Kelleher
Cc: Rusty Russell
Cc: "Michael S. Tsirkin"
Cc: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge
Cc: Neil Brown
Cc: Alasdair Kergon
Cc: Mike Snitzer
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh
Cc: Benny Halevy
Cc: "James E.J. Bottomley"
Cc: Greg Kroah-Hartman
Cc: "Nicholas A. Bellinger"
Cc: Alexander Viro
Cc: Chris Mason
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Jaegeuk Kim
Cc: Steven Whitehouse
Cc: Dave Kleikamp
Cc: Joern Engel
Cc: Prasad Joshi
Cc: Trond Myklebust
Cc: KONISHI Ryusuke
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Ben Myers
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Len Brown
Cc: Pavel Machek
Cc: "Rafael J. Wysocki"
Cc: Herton Ronaldo Krzesinski
Cc: Ben Hutchings
Cc: Andrew Morton
Cc: Guo Chao
Cc: Tejun Heo
Cc: Asai Thambi S P
Cc: Selvan Mani
Cc: Sam Bradshaw
Cc: Wei Yongjun
Cc: "Roger Pau Monné"
Cc: Jan Beulich
Cc: Stefano Stabellini
Cc: Ian Campbell
Cc: Sebastian Ott
Cc: Christian Borntraeger
Cc: Minchan Kim
Cc: Jiang Liu
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Joe Perches
Cc: Peng Tao
Cc: Andy Adamson
Cc: fanchaoting
Cc: Jie Liu
Cc: Sunil Mushran
Cc: "Martin K. Petersen"
Cc: Namjae Jeon
Cc: Pankaj Kumar
Cc: Dan Magenheimer
Cc: Mel Gorman 6

Kent Overstreet
2013-11-24 14:33:47 +0800

21 Nov, 2013

1 commit

6d6e352c8 Merge tag 'md/3.13' of git://neil.brown.name/md ... Browse Code »

Pull md update from Neil Brown:
"Mostly optimisations and obscure bug fixes.
- raid5 gets less lock contention
- raid1 gets less contention between normal-io and resync-io during
resync"

* tag 'md/3.13' of git://neil.brown.name/md:
md/raid5: Use conf->device_lock protect changing of multi-thread resources.
md/raid5: Before freeing old multi-thread worker, it should flush them.
md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
UAPI: include in linux/raid/md_p.h
raid1: Rewrite the implementation of iobarrier.
raid1: Add some macros to make code clearly.
raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
raid1: Add a field array_frozen to indicate whether raid in freeze state.
md: Convert use of typedef ctl_table to struct ctl_table
md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
md: fix some places where mddev_lock return value is not checked.
raid5: Retry R5_ReadNoMerge flag when hit a read error.
raid5: relieve lock contention in get_active_stripe()
raid5: relieve lock contention in get_active_stripe()
wait: add wait_event_cmd()
md/raid5.c: add proper locking to error path of raid5_start_reshape.
md: fix calculation of stacking limits on level change.
raid5: Use slow_path to release stripe when mddev->thread is null

Linus Torvalds
2013-11-21 05:05:25 +0800

19 Nov, 2013

4 commits

79ef3a8aa raid1: Rewrite the implementation of iobarrier. ... Browse Code »
144

There is an iobarrier in raid1 because of contention between normal IO and
resync IO. It suspends all normal IO when resync/recovery happens.

However if normal IO is out side the resync window, there is no contention.
So this patch changes the barrier mechanism to only block IO that
could contend with the resync that is currently happening.

We partition the whole space into five parts.
|---------|-----------|------------|----------------|-------|
start next_resync start_next_window end_window

start + RESYNC_WINDOW = next_resync
next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
start_next_window + NEXT_NORMALIO_DISTANCE = end_window

Firstly we introduce some concepts:

1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
and start_next_window. It also indicates the distance between
start_next_window and end_window.
It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
this turned out not to be optimal.
3 - next_resync: the next sector at which we will do sync IO.
4 - start: a position which is at most RESYNC_WINDOW before
next_resync.
5 - start_next_window: a position which is NEXT_NORMALIO_DISTANCE
beyond next_resync. Normal-io after this position doesn't need to
wait for resync-io to complete.
6 - end_window: a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
next_resync. This also doesn't need to wait, but is counted
differently.
7 - current_window_requests: the count of normalIO between
start_next_window and end_window.
8 - next_window_requests: the count of normalIO after end_window.

NormalIO will be partitioned into four types:

NormIO1: the end sector of bio is smaller or equal the start
NormIO2: the start sector of bio larger or equal to end_window
NormIO3: the start sector of bio larger or equal to
start_next_window.
NormIO4: the location between start_next_window and end_window

|--------|-----------|--------------------|----------------|-------------|
| start | next_resync | start_next_window | end_window |
NormIO1 NormIO4 NormIO4 NormIO3 NormIO2

For NormIO1, we don't need any io barrier.
For NormIO4, we used a similar approach to the original iobarrier
mechanism. The normalIO and resyncIO must be kept separate.
For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
and "next_window_requests". They indicate the count of active
requests in the two window.
For these, we don't wait for resync io to complete.

For resync action, if there are NormIO4s, we must wait for it.
If not, we can proceed.
But if resync action reaches start_next_window and
current_window_requests > 0 (that is there are NormIO3s), we must
wait until the current_window_requests becomes zero.
When current_window_requests becomes zero, start_next_window also
moves forward. Then current_window_requests will replaced by
next_window_requests.

There is a problem which when and how to change from NormIO2 to
NormIO3. Only then can sync action progress.

We add a field in struct r1conf "start_next_window".

A: if start_next_window == MaxSector, it means there are no NormIO2/3.
So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
B: if current_window_requests == 0 && next_window_requests != 0, it
means start_next_window move to end_window

There is another problem which how to differentiate between
old NormIO2(now it is NormIO3) and NormIO2.
For example, there are many bios which are NormIO2 and a bio which is
NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.

We add a field in struct r1bio "start_next_window".
This is used to record the position conf->start_next_window when the call
to wait_barrier() is made in make_request().

In allow_barrier(), we check the conf->start_next_window.
If r1bio->stat_next_window == conf->start_next_window, it means
there is no transition between NormIO2 and NormIO3.
If r1bio->start_next_window != conf->start_next_window, it mean
there was a transition between NormIO2 and NormIO3. There can only
have been one transition. So it only means the bio is old NormIO2.

For one bio, there may be many r1bio's. So we make sure
all the r1bio->start_next_window are the same value.
If we met blocked_dev in make_request(), it must call allow_barrier
and wait_barrier. So the former and the later value of
conf->start_next_window will be change.
If there are many r1bio's with differnet start_next_window,
for the relevant bio, it depend on the last value of r1bio.
It will cause error. To avoid this, we must wait for previous r1bios
to complete.

Signed-off-by: Jianpeng Ma
Signed-off-by: NeilBrown

majianpeng
2013-11-19 12:19:18 +0800
8e005f7c0 raid1: Add some macros to make code clearly. ... Browse Code »

In a subsequent patch, we'll use some const parameters.
Using macros will make the code clearly.

Signed-off-by: Jianpeng Ma
Signed-off-by: NeilBrown

majianpeng
2013-11-19 12:19:18 +0800
07169fd47 raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when… ... Browse Code »

… reconfiguring the array.

We used to use raise_barrier to suspend normal IO while we reconfigure
the array. However raise_barrier will soon only suspend some normal
IO, not all. So we need something else.
Change it to use freeze_array.
But freeze_array not only suspends normal io, it also suspends
resync io.
For the place where call raise_barrier for reconfigure, it isn't a
problem.

Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>

majianpeng
2013-11-19 12:19:18 +0800
b364e3d04 raid1: Add a field array_frozen to indicate whether raid in freeze state. ... Browse Code »

Because the following patch will rewrite the content between normal IO
and resync IO. So we used a parameter to indicate whether raid is in freeze
array.

Signed-off-by: Jianpeng Ma
Signed-off-by: NeilBrown

majianpeng
2013-11-19 12:19:18 +0800

09 Nov, 2013

1 commit

6678d83f1 block: Consolidate duplicated bio_trim() implementations ... Browse Code »

Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on,
we should know better than this.

Signed-off-by: Kent Overstreet
Cc: Jens Axboe
Cc: Neil Brown
Cc: Konrad Rzeszutek Wilk
Cc: Jeremy Fitzhardinge
Signed-off-by: Jens Axboe

Kent Overstreet
2013-11-09 00:02:31 +0800

24 Oct, 2013

1 commit

61e4947c9 md: Fix skipping recovery for read-only arrays. ... Browse Code »

Since:
commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86
md: Allow devices to be re-added to a read-only array.

spares are activated on a read-only array. In case of raid1 and raid10
personalities it causes that not-in-sync devices are marked in-sync
without checking if recovery has been finished.

If a read-only array is degraded and one of its devices is not in-sync
(because the array has been only partially recovered) recovery will be skipped.

This patch adds checking if recovery has been finished before marking a device
in-sync for raid1 and raid10 personalities. In case of raid5 personality
such condition is already present (at raid5.c:6029).

Bug was introduced in 3.10 and causes data corruption.

Cc: stable@vger.kernel.org
Signed-off-by: Pawel Baldysiak
Signed-off-by: Lukasz Dorau
Signed-off-by: NeilBrown

Lukasz Dorau
2013-10-24 09:55:17 +0800

18 Jul, 2013

1 commit

30bc9b538 md/raid1: fix bio handling problems in process_checks() ... Browse Code »

Recent change to use bio_copy_data() in raid1 when repairing
an array is faulty.

The underlying may have changed the bio in various ways using
bio_advance and these need to be undone not just for the 'sbio' which
is being copied to, but also the 'pbio' (primary) which is being
copied from.

So perform the reset on all bios that were read from and do it early.

This also ensure that the sbio->bi_io_vec[j].bv_len passed to
memcmp is correct.

This fixes a crash during a 'check' of a RAID1 array. The crash was
introduced in 3.10 so this is suitable for 3.10-stable.

Cc: stable@vger.kernel.org (3.10)
Reported-by: Joe Lawrence
Signed-off-by: NeilBrown

NeilBrown
2013-07-18 12:18:04 +0800

14 Jun, 2013

2 commits

9092c02d9 DM RAID: Add ability to restore transiently failed devices on resume ... Browse Code »

DM RAID: Add ability to restore transiently failed devices on resume

This patch adds code to the resume function to check over the devices
in the RAID array. If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array. This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.

Signed-off-by: Jonathan Brassow
Signed-off-by: NeilBrown

Jonathan Brassow
2013-06-14 06:10:24 +0800
82ea4be61 Merge tag 'md-3.10-fixes' of git://neil.brown.name/md ... Browse Code »

Pull md bugfixes from Neil Brown:
"A few bugfixes for md

Some tagged for -stable"

* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.

Linus Torvalds
2013-06-14 01:13:29 +0800

13 Jun, 2013

3 commits

5026d7a9b md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place ... Browse Code »

There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.

After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.

However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.

Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.

[neilb: added raid5]

This patch is appropriate for any -stable since 3.7 when write_same
support was added.

Cc: stable@vger.kernel.org
Signed-off-by: H. Peter Anvin
Signed-off-by: NeilBrown

H. Peter Anvin
2013-06-13 12:49:54 +0800
e2d599252 md/raid1,raid10: use freeze_array in place of raise_barrier in various places. ... Browse Code »

Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.

Using raise_barrier will sometimes deadlock. Using freeze_array
should not.

As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.

The deadlock was made particularly noticeable by commits
050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.

This patch probably won't apply directly to some early kernels and
will need to be applied by hand.

Cc: stable@vger.kernel.org
Reported-by: Alexander Lyakas
Signed-off-by: NeilBrown

NeilBrown
2013-06-13 11:40:48 +0800
3056e3aec md/raid1: consider WRITE as successful only if at least one non-Faulty and non-r… ... Browse Code »

…ebuilding drive completed it.

Without that fix, the following scenario could happen:

- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io() calls call_bio_endio()
- As a result, in call_bio_endio():
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.

So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.

[neilb - applied identical change to raid10 as well]

This bug can result in lost data, so it is suitable for any
-stable kernel.

Cc: stable@vger.kernel.org
Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: NeilBrown <neilb@suse.de>

Alex Lyakas
2013-06-13 11:20:03 +0800

09 May, 2013

1 commit

4de13d7aa Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block core updates from Jens Axboe:

- Major bit is Kents prep work for immutable bio vecs.

- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.

- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.

- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.

- Runtime PM framework, SCSI patches exists on top of these in James'
tree.

- A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...

Linus Torvalds
2013-05-09 01:13:35 +0800

30 Apr, 2013

2 commits

32f9f570d MD: ignore discard request for hard disks of hybid raid1/raid10 array ... Browse Code »

In SSD/hard disk hybid storage, discard request should be ignored for hard
disk. We used to be doing this way, but the unplug path forgets it.

This is suitable for stable tree since v3.6.

Cc: stable@vger.kernel.org
Reported-and-tested-by: Markus
Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2013-04-30 12:49:36 +0800
0fea7ed82 md: raid1/raid10 md devices leak memory when stopping ... Browse Code »

Hi.

Raid1 and raid10 devices leak memory every time they stop.
This is a patch for linux-3.9.0-rc7 to fix this problem.

Thanks,
Hirokazu Takahashi.

Signed-off-by: Hirokazu Takahashi
Signed-off-by: NeilBrown

Hirokazu Takahashi
2013-04-30 12:49:26 +0800

24 Mar, 2013

9 commits

a07876064 block: Add bio_alloc_pages() ... Browse Code »
5

More utility code to replace stuff that's getting open coded.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: NeilBrown

Kent Overstreet
2013-03-24 05:26:31 +0800
cb34e057a block: Convert some code to bio_for_each_segment_all() ... Browse Code »

More prep work for immutable bvecs:

A few places in the code were either open coding or using the wrong
version - fix.

After we introduce the bvec iter, it'll no longer be possible to modify
the biovec through bio_for_each_segment_all() - it doesn't increment a
pointer to the current bvec, you pass in a struct bio_vec (not a
pointer) which is updated with what the current biovec would be (taking
into account bi_bvec_done and bi_size).

So because of that it's more worthwhile to be consistent about
bio_for_each_segment()/bio_for_each_segment_all() usage.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: NeilBrown
CC: Alasdair Kergon
CC: dm-devel@redhat.com
CC: Alexander Viro

Kent Overstreet
2013-03-24 05:26:30 +0800
d74c6d514 block: Add bio_for_each_segment_all() ... Browse Code »

__bio_for_each_segment() iterates bvecs from the specified index
instead of bio->bv_idx. Currently, the only usage is to walk all the
bvecs after the bio has been advanced by specifying 0 index.

For immutable bvecs, we need to split these apart;
bio_for_each_segment() is going to have a different implementation.
This will also help document the intent of code that's using it -
bio_for_each_segment_all() is only legal to use for code that owns the
bio.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: Neil Brown
CC: Boaz Harrosh

Kent Overstreet
2013-03-24 05:26:28 +0800
d3b45c2a0 raid1: use bio_copy_data() ... Browse Code »

This doesn't really delete any code _yet_, but once immutable bvecs are
done we can just delete the rest of the code in that loop.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: NeilBrown

Kent Overstreet
2013-03-24 05:15:38 +0800
b783863f6 raid1: Refactor narrow_write_error() to not use bi_idx ... Browse Code »

More bi_idx removal. This code was just open coding bio_clone(). This
could probably be further improved by using bio_advance() instead of
skipping over null pages, but that'd be a larger rework.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: NeilBrown

Kent Overstreet
2013-03-24 05:15:36 +0800
2aabaa65a raid1: use bio_reset() ... Browse Code »

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: NeilBrown

Kent Overstreet
2013-03-24 05:15:34 +0800
9e882242c block: Add submit_bio_wait(), remove from md ... Browse Code »
36

Random cleanup - this code was duplicated and it's not really specific
to md.

Also added the ability to return the actual error code.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: NeilBrown
Acked-by: Tejun Heo

Kent Overstreet
2013-03-24 05:15:32 +0800
aa8b57aa3 block: Use bio_sectors() more consistently ... Browse Code »

Bunch of places in the code weren't using it where they could be -
this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
into a struct bvec_iter.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: "Ed L. Cashin"
CC: Nick Piggin
CC: Jiri Kosina
CC: Jim Paris
CC: Geoff Levand
CC: Alasdair Kergon
CC: dm-devel@redhat.com
CC: Neil Brown
CC: Steven Rostedt
Acked-by: Ed Cashin

Kent Overstreet
2013-03-24 05:15:30 +0800
f73a1c7d1 block: Add bio_end_sector() ... Browse Code »

Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.

Signed-off-by: Kent Overstreet
CC: Jens Axboe
CC: Lars Ellenberg
CC: Jiri Kosina
CC: Alasdair Kergon
CC: dm-devel@redhat.com
CC: Neil Brown
CC: Martin Schwidefsky
CC: Heiko Carstens
CC: linux-s390@vger.kernel.org
CC: Chris Mason
CC: Steven Whitehouse
Acked-by: Steven Whitehouse

Kent Overstreet
2013-03-24 05:15:29 +0800

26 Feb, 2013

2 commits

ee0b02440 md/raid1,raid10: fix deadlock with freeze_array() ... Browse Code »

When raid1/raid10 needs to fix a read error, it first drains
all pending requests by calling freeze_array().
This calls flush_pending_writes() if it needs to sleep,
but some writes may be pending in a per-process plug rather
than in the per-array request queue.

When raid1{,0}_unplug() moves the request from the per-process
plug to the per-array request queue (from which
flush_pending_writes() can flush them), it needs to wake up
freeze_array(), or freeze_array() will never flush them and so
it will block forever.

So add the requires wake_up() calls.

This bug was introduced by commit
f54a9d0e59c4bea3db733921ca9147612a6f292c
for raid1 and a similar commit for RAID10, and so has been present
since linux-3.6. As the bug causes a deadlock I believe this fix is
suitable for -stable.

Cc: stable@vger.kernel.org (3.6.y 3.7.y 3.8.y)
Reported-by: Tregaron Bayly
Tested-by: Tregaron Bayly
Signed-off-by: NeilBrown

NeilBrown
2013-02-26 08:58:50 +0800
c8dc9c654 md: raid1,10: Handle REQ_WRITE_SAME flag in write bios ... Browse Code »

Set mddev queue's max_write_same_sectors to its chunk_sector value (before
disk_stack_limits merges the underlying disk limits.) With that in place,
be sure to handle writes coming down from the block layer that have the
REQ_WRITE_SAME flag set. That flag needs to be copied into any newly cloned
write bio.

Signed-off-by: Joe Lawrence
Acked-by: "Martin K. Petersen"
Signed-off-by: NeilBrown

Joe Lawrence
2013-02-26 08:55:21 +0800

18 Dec, 2012

1 commit

9228ff903 Merge branch 'for-3.8/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver update from Jens Axboe:
"Now that the core bits are in, here are the driver bits for 3.8. The
branch contains:

- A huge pile of drbd bits that were dumped from the 3.7 merge
window. Following that, it was both made perfectly clear that
there is going to be no more over-the-wall pulls and how the
situation on individual pulls can be improved.

- A few cleanups from Akinobu Mita for drbd and cciss.

- Queue improvement for loop from Lukas. This grew into adding a
generic interface for waiting/checking an even with a specific
lock, allowing this to be pulled out of md and now loop and drbd is
also using it.

- A few fixes for xen back/front block driver from Roger Pau Monne.

- Partition improvements from Stephen Warren, allowing partiion UUID
to be used as an identifier."

* 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
drbd: update Kconfig to match current dependencies
drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
drbd: close race between drbd_set_role and drbd_connect
drbd: respect no-md-barriers setting also when changed online via disk-options
drbd: Remove obsolete check
drbd: fixup after wait_even_lock_irq() addition to generic code
loop: Limit the number of requests in the bio list
wait: add wait_event_lock_irq() interface
xen-blkfront: free allocated page
xen-blkback: move free persistent grants code
block: partition: msdos: provide UUIDs for partitions
init: reduce PARTUUID min length to 1 from 36
block: store partition_meta_info.uuid as a string
cciss: use check_signature()
cciss: cleanup bitops usage
drbd: use copy_highpage
drbd: if the replication link breaks during handshake, keep retrying
drbd: check return of kmalloc in receive_uuids
drbd: Broadcast sync progress no more often than once per second
drbd: don't try to clear bits once the disk has failed
...

Linus Torvalds
2012-12-18 05:39:11 +0800

30 Nov, 2012

1 commit

eed8c02e6 wait: add wait_event_lock_irq() interface ... Browse Code »

New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
moves the private wait_event_lock_irq() macro from MD to regular wait
includes, introduces new macro wait_event_lock_irq_cmd() instead of using
the old method with omitting cmd parameter which is ugly and makes a use
of new macros in the MD. It also introduces the _interruptible_ variant.

The use of new interface is when one have a special lock to protect data
structures used in the condition, or one also needs to invoke "cmd"
before putting it to sleep.

All new macros are expected to be called with the lock taken. The lock
is released before sleep and is reacquired afterwards. We will leave the
macro with the lock held.

Note to DM: IMO this should also fix theoretical race on waitqueue while
using simultaneously wait_event_lock_irq() and wait_event() because of
lack of locking around current state setting and wait queue removal.

Signed-off-by: Lukas Czerner
Cc: Neil Brown
Cc: David Howells
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Jens Axboe

Lukas Czerner
2012-11-30 18:47:57 +0800

27 Nov, 2012

1 commit

874807a83 md/raid1{,0}: fix deadlock in bitmap_unplug. ... Browse Code »

If the raid1 or raid10 unplug function gets called
from a make_request function (which is very possible) when
there are bios on the current->bio_list list, then it will not
be able to successfully call bitmap_unplug() and it could
need to submit more bios and wait for them to complete.
But they won't complete while current->bio_list is non-empty.

So detect that case and handle the unplugging off to another thread
just like we already do when called from within the scheduler.

RAID1 version of bug was introduced in 3.6, so that part of fix is
suitable for 3.6.y. RAID10 part won't apply.

Cc: stable@vger.kernel.org
Reported-by: Torsten Kaiser
Reported-by: Peter Maloney
Signed-off-by: NeilBrown

NeilBrown
2012-11-27 09:14:40 +0800

31 Oct, 2012

1 commit

02b898f2f md/raid1: Fix assembling of arrays containing Replacements. ... Browse Code »

setup_conf in raid1.c uses conf->raid_disks before assigning
a value. It is used when including 'Replacement' devices.

The consequence is that assembling an array which contains a
replacement will misbehave and either not include the replacement, or
not include the device being replaced.

Though this doesn't lead directly to data corruption, it could lead to
reduced data safety.

So use mddev->raid_disks, which is initialised, instead.

Bug was introduced by commit c19d57980b38a5bb613a898937a1cf85f422fb9b
md/raid1: recognise replacements when assembling arrays.

in 3.3, so fix is suitable for 3.3.y thru 3.6.y.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown

NeilBrown
2012-10-31 08:42:03 +0800

11 Oct, 2012

4 commits

7f7583d42 Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races ... Browse Code »

Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.

Signed-off-by: Jianpeng Ma
Signed-off-by: NeilBrown

Jianpeng Ma
2012-10-11 11:17:59 +0800
7ad4d4a68 md/raid1: Don't release reference to device while handling read error. ... Browse Code »

When we get a read error, we arrange for raid1d to handle it.
Currently we release the reference on the device. This can result
in
conf->mirrors[read_disk].rdev
being NULL in fix_read_error, if the device happens to get removed
before the read error is handled.

So instead keep the reference until the read error has been fully
handled.

Reported-by: hank
Signed-off-by: NeilBrown

NeilBrown
2012-10-11 10:44:30 +0800
4ed8731d8 MD: change the parameter of md thread ... Browse Code »

Change the thread parameter, so the thread can carry extra info. Next patch
will use it.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2012-10-11 10:34:00 +0800
2ff8cc2c6 md: raid 1 supports TRIM ... Browse Code »

This makes md raid 1 support TRIM.
If one disk supports discard and another not, or one has discard_zero_data and
another not, there could be inconsistent between data from such disks. But this
should not matter, discarded data is useless. This will add extra copy in rebuild
though.

Signed-off-by: Shaohua Li
Signed-off-by: NeilBrown

Shaohua Li
2012-10-11 10:28:54 +0800