31 Jul, 2014

1 commit

  • Currently we don't abort recovery on a write error if the write error
    to the recovering device was triggerd by normal IO (as opposed to
    recovery IO).

    This means that for one bitmap region, the recovery might write to the
    recovering device for a few sectors, then not bother for subsequent
    sectors (as it never writes to failed devices). In this case
    the bitmap bit will be cleared, but it really shouldn't.

    The result is that if the recovering device fails and is then re-added
    (after fixing whatever hardware problem triggerred the failure),
    the second recovery won't redo the region it was in the middle of,
    so some of the device will not be recovered properly.

    If we abort the recovery, the region being processes will be cancelled
    (bit not cleared) and the whole region will be retried.

    As the bug can result in data corruption the patch is suitable for
    -stable. For kernels prior to 3.11 there is a conflict in raid10.c
    which will require care.

    Original-from: jiao hui
    Reported-and-tested-by: jiao hui
    Signed-off-by: NeilBrown
    Cc: stable@vger.kernel.org

    NeilBrown
     

09 Apr, 2014

1 commit

  • When performing a user-request check/repair (MD_RECOVERY_REQUEST is set)
    on a raid1, we allocate multiple bios each with their own set of pages.

    If the page allocations for one bio fails, we currently do *not* free
    the pages allocated for the previous bios, nor do we free the bio itself.

    This patch frees all the already-allocate pages, and makes sure that
    all the bios are freed as well.

    This bug can cause a memory leak which can ultimately OOM a machine.
    It was introduced in 3.10-rc1.

    Fixes: a07876064a0b73ab5ef1ebcf14b1cf0231c07858
    Cc: Kent Overstreet
    Cc: stable@vger.kernel.org (3.10+)
    Reported-by: Russell King - ARM Linux
    Signed-off-by: NeilBrown

    NeilBrown
     

05 Feb, 2014

1 commit

  • commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
    md/raid1: fix bio handling problems in process_checks()

    Move the bio_reset() to a point before where BIO_UPTODATE is checked,
    so that check now always report that the bio is uptodate, even if it is not.

    This causes process_check() to sometimes treat read-errors as
    successful matches so the good data isn't written out.

    This patch preserves the flag until it is needed.

    Bug was introduced in 3.11, but backported to 3.10-stable (as it fixed
    an even worse bug). So suitable for any -stable since 3.10.

    Reported-and-tested-by: Michael Tokarev
    Cc: stable@vger.kernel.org (3.10+)
    Fixed: 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
    Signed-off-by: NeilBrown

    NeilBrown
     

31 Jan, 2014

1 commit

  • Pull core block IO changes from Jens Axboe:
    "The major piece in here is the immutable bio_ve series from Kent, the
    rest is fairly minor. It was supposed to go in last round, but
    various issues pushed it to this release instead. The pull request
    contains:

    - Various smaller blk-mq fixes from different folks. Nothing major
    here, just minor fixes and cleanups.

    - Fix for a memory leak in the error path in the block ioctl code
    from Christian Engelmayer.

    - Header export fix from CaiZhiyong.

    - Finally the immutable biovec changes from Kent Overstreet. This
    enables some nice future work on making arbitrarily sized bios
    possible, and splitting more efficient. Related fixes to immutable
    bio_vecs:

    - dm-cache immutable fixup from Mike Snitzer.
    - btrfs immutable fixup from Muthu Kumar.

    - bio-integrity fix from Nic Bellinger, which is also going to stable"

    * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
    xtensa: fixup simdisk driver to work with immutable bio_vecs
    block/blk-mq-cpu.c: use hotcpu_notifier()
    blk-mq: for_each_* macro correctness
    block: Fix memory leak in rw_copy_check_uvector() handling
    bio-integrity: Fix bio_integrity_verify segment start bug
    block: remove unrelated header files and export symbol
    blk-mq: uses page->list incorrectly
    blk-mq: use __smp_call_function_single directly
    btrfs: fix missing increment of bi_remaining
    Revert "block: Warn and free bio if bi_end_io is not set"
    block: Warn and free bio if bi_end_io is not set
    blk-mq: fix initializing request's start time
    block: blk-mq: don't export blk_mq_free_queue()
    block: blk-mq: make blk_sync_queue support mq
    block: blk-mq: support draining mq queue
    dm cache: increment bi_remaining when bi_end_io is restored
    block: fixup for generic bio chaining
    block: Really silence spurious compiler warnings
    block: Silence spurious compiler warnings
    block: Kill bio_pair_split()
    ...

    Linus Torvalds
     

14 Jan, 2014

1 commit

  • The new iobarrier implementation in raid1 (which keeps normal writes
    and resync activity separate) counts every request what is not before
    the current resync point in either next_window_requests or
    current_window_requests.
    It flags that the request is counted by setting ->start_next_window.

    allow_barrier follows this model exactly and decrements one of the
    *_window_requests if and only if ->start_next_window is set.

    However wait_barrier(), which increments *_window_requests uses a
    slightly different test for setting -.start_next_window (which is set
    from the return value of this function).
    So there is a possibility of the counts getting out of sync, and this
    leads to the resync hanging.

    So change wait_barrier() to return a non-zero value in exactly the
    same cases that it increments *_window_requests.

    But was introduced in 3.13-rc1.

    Reported-by: Bruno Wolff III
    URL: https://bugzilla.kernel.org/show_bug.cgi?id=68061
    Fixes: 79ef3a8aa1cb1523cc231c9a90a278333c21f761
    Cc: majianpeng
    Signed-off-by: NeilBrown

    NeilBrown
     

24 Nov, 2013

1 commit

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

21 Nov, 2013

1 commit

  • Pull md update from Neil Brown:
    "Mostly optimisations and obscure bug fixes.
    - raid5 gets less lock contention
    - raid1 gets less contention between normal-io and resync-io during
    resync"

    * tag 'md/3.13' of git://neil.brown.name/md:
    md/raid5: Use conf->device_lock protect changing of multi-thread resources.
    md/raid5: Before freeing old multi-thread worker, it should flush them.
    md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE.
    UAPI: include in linux/raid/md_p.h
    raid1: Rewrite the implementation of iobarrier.
    raid1: Add some macros to make code clearly.
    raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array.
    raid1: Add a field array_frozen to indicate whether raid in freeze state.
    md: Convert use of typedef ctl_table to struct ctl_table
    md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes.
    md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.
    md: fix some places where mddev_lock return value is not checked.
    raid5: Retry R5_ReadNoMerge flag when hit a read error.
    raid5: relieve lock contention in get_active_stripe()
    raid5: relieve lock contention in get_active_stripe()
    wait: add wait_event_cmd()
    md/raid5.c: add proper locking to error path of raid5_start_reshape.
    md: fix calculation of stacking limits on level change.
    raid5: Use slow_path to release stripe when mddev->thread is null

    Linus Torvalds
     

19 Nov, 2013

4 commits

  • There is an iobarrier in raid1 because of contention between normal IO and
    resync IO. It suspends all normal IO when resync/recovery happens.

    However if normal IO is out side the resync window, there is no contention.
    So this patch changes the barrier mechanism to only block IO that
    could contend with the resync that is currently happening.

    We partition the whole space into five parts.
    |---------|-----------|------------|----------------|-------|
    start next_resync start_next_window end_window

    start + RESYNC_WINDOW = next_resync
    next_resync + NEXT_NORMALIO_DISTANCE = start_next_window
    start_next_window + NEXT_NORMALIO_DISTANCE = end_window

    Firstly we introduce some concepts:

    1 - RESYNC_WINDOW: For resync, there are 32 resync requests at most at the
    same time. A sync request is RESYNC_BLOCK_SIZE(64*1024).
    So the RESYNC_WINDOW is 32 * RESYNC_BLOCK_SIZE, that is 2MB.
    2 - NEXT_NORMALIO_DISTANCE: the distance between next_resync
    and start_next_window. It also indicates the distance between
    start_next_window and end_window.
    It is currently 3 * RESYNC_WINDOW_SIZE but could be tuned if
    this turned out not to be optimal.
    3 - next_resync: the next sector at which we will do sync IO.
    4 - start: a position which is at most RESYNC_WINDOW before
    next_resync.
    5 - start_next_window: a position which is NEXT_NORMALIO_DISTANCE
    beyond next_resync. Normal-io after this position doesn't need to
    wait for resync-io to complete.
    6 - end_window: a position which is 2 * NEXT_NORMALIO_DISTANCE beyond
    next_resync. This also doesn't need to wait, but is counted
    differently.
    7 - current_window_requests: the count of normalIO between
    start_next_window and end_window.
    8 - next_window_requests: the count of normalIO after end_window.

    NormalIO will be partitioned into four types:

    NormIO1: the end sector of bio is smaller or equal the start
    NormIO2: the start sector of bio larger or equal to end_window
    NormIO3: the start sector of bio larger or equal to
    start_next_window.
    NormIO4: the location between start_next_window and end_window

    |--------|-----------|--------------------|----------------|-------------|
    | start | next_resync | start_next_window | end_window |
    NormIO1 NormIO4 NormIO4 NormIO3 NormIO2

    For NormIO1, we don't need any io barrier.
    For NormIO4, we used a similar approach to the original iobarrier
    mechanism. The normalIO and resyncIO must be kept separate.
    For NormIO2/3, we add two fields to struct r1conf: "current_window_requests"
    and "next_window_requests". They indicate the count of active
    requests in the two window.
    For these, we don't wait for resync io to complete.

    For resync action, if there are NormIO4s, we must wait for it.
    If not, we can proceed.
    But if resync action reaches start_next_window and
    current_window_requests > 0 (that is there are NormIO3s), we must
    wait until the current_window_requests becomes zero.
    When current_window_requests becomes zero, start_next_window also
    moves forward. Then current_window_requests will replaced by
    next_window_requests.

    There is a problem which when and how to change from NormIO2 to
    NormIO3. Only then can sync action progress.

    We add a field in struct r1conf "start_next_window".

    A: if start_next_window == MaxSector, it means there are no NormIO2/3.
    So start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
    B: if current_window_requests == 0 && next_window_requests != 0, it
    means start_next_window move to end_window

    There is another problem which how to differentiate between
    old NormIO2(now it is NormIO3) and NormIO2.
    For example, there are many bios which are NormIO2 and a bio which is
    NormIO3. NormIO3 firstly completed, so the bios of NormIO2 became NormIO3.

    We add a field in struct r1bio "start_next_window".
    This is used to record the position conf->start_next_window when the call
    to wait_barrier() is made in make_request().

    In allow_barrier(), we check the conf->start_next_window.
    If r1bio->stat_next_window == conf->start_next_window, it means
    there is no transition between NormIO2 and NormIO3.
    If r1bio->start_next_window != conf->start_next_window, it mean
    there was a transition between NormIO2 and NormIO3. There can only
    have been one transition. So it only means the bio is old NormIO2.

    For one bio, there may be many r1bio's. So we make sure
    all the r1bio->start_next_window are the same value.
    If we met blocked_dev in make_request(), it must call allow_barrier
    and wait_barrier. So the former and the later value of
    conf->start_next_window will be change.
    If there are many r1bio's with differnet start_next_window,
    for the relevant bio, it depend on the last value of r1bio.
    It will cause error. To avoid this, we must wait for previous r1bios
    to complete.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • In a subsequent patch, we'll use some const parameters.
    Using macros will make the code clearly.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     
  • … reconfiguring the array.

    We used to use raise_barrier to suspend normal IO while we reconfigure
    the array. However raise_barrier will soon only suspend some normal
    IO, not all. So we need something else.
    Change it to use freeze_array.
    But freeze_array not only suspends normal io, it also suspends
    resync io.
    For the place where call raise_barrier for reconfigure, it isn't a
    problem.

    Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
    Signed-off-by: NeilBrown <neilb@suse.de>

    majianpeng
     
  • Because the following patch will rewrite the content between normal IO
    and resync IO. So we used a parameter to indicate whether raid is in freeze
    array.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: NeilBrown

    majianpeng
     

09 Nov, 2013

1 commit


24 Oct, 2013

1 commit

  • Since:
    commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86
    md: Allow devices to be re-added to a read-only array.

    spares are activated on a read-only array. In case of raid1 and raid10
    personalities it causes that not-in-sync devices are marked in-sync
    without checking if recovery has been finished.

    If a read-only array is degraded and one of its devices is not in-sync
    (because the array has been only partially recovered) recovery will be skipped.

    This patch adds checking if recovery has been finished before marking a device
    in-sync for raid1 and raid10 personalities. In case of raid5 personality
    such condition is already present (at raid5.c:6029).

    Bug was introduced in 3.10 and causes data corruption.

    Cc: stable@vger.kernel.org
    Signed-off-by: Pawel Baldysiak
    Signed-off-by: Lukasz Dorau
    Signed-off-by: NeilBrown

    Lukasz Dorau
     

18 Jul, 2013

1 commit

  • Recent change to use bio_copy_data() in raid1 when repairing
    an array is faulty.

    The underlying may have changed the bio in various ways using
    bio_advance and these need to be undone not just for the 'sbio' which
    is being copied to, but also the 'pbio' (primary) which is being
    copied from.

    So perform the reset on all bios that were read from and do it early.

    This also ensure that the sbio->bi_io_vec[j].bv_len passed to
    memcmp is correct.

    This fixes a crash during a 'check' of a RAID1 array. The crash was
    introduced in 3.10 so this is suitable for 3.10-stable.

    Cc: stable@vger.kernel.org (3.10)
    Reported-by: Joe Lawrence
    Signed-off-by: NeilBrown

    NeilBrown
     

14 Jun, 2013

2 commits

  • DM RAID: Add ability to restore transiently failed devices on resume

    This patch adds code to the resume function to check over the devices
    in the RAID array. If any are found to be marked as failed and their
    superblocks can be read, an attempt is made to reintegrate them into
    the array. This allows the user to refresh the array with a simple
    suspend and resume of the array - rather than having to load a
    completely new table, allocate and initialize all the structures and
    throw away the old instantiation.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    Jonathan Brassow
     
  • Pull md bugfixes from Neil Brown:
    "A few bugfixes for md

    Some tagged for -stable"

    * tag 'md-3.10-fixes' of git://neil.brown.name/md:
    md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
    md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
    md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
    md: md_stop_writes() should always freeze recovery.

    Linus Torvalds
     

13 Jun, 2013

3 commits

  • There are cases where the kernel will believe that the WRITE SAME
    command is supported by a block device which does not, in fact,
    support WRITE SAME. This currently happens for SATA drivers behind a
    SAS controller, but there are probably a hundred other ways that can
    happen, including drive firmware bugs.

    After receiving an error for WRITE SAME the block layer will retry the
    request as a plain write of zeroes, but mdraid will consider the
    failure as fatal and consider the drive failed. This has the effect
    that all the mirrors containing a specific set of data are each
    offlined in very rapid succession resulting in data loss.

    However, just bouncing the request back up to the block layer isn't
    ideal either, because the whole initial request-retry sequence should
    be inside the write bitmap fence, which probably means that md needs
    to do its own conversion of WRITE SAME to write zero.

    Until the failure scenario has been sorted out, disable WRITE SAME for
    raid1, raid5, and raid10.

    [neilb: added raid5]

    This patch is appropriate for any -stable since 3.7 when write_same
    support was added.

    Cc: stable@vger.kernel.org
    Signed-off-by: H. Peter Anvin
    Signed-off-by: NeilBrown

    H. Peter Anvin
     
  • Various places in raid1 and raid10 are calling raise_barrier when they
    really should call freeze_array.
    The former is only intended to be called from "make_request".
    The later has extra checks for 'nr_queued' and makes a call to
    flush_pending_writes(), so it is safe to call it from within the
    management thread.

    Using raise_barrier will sometimes deadlock. Using freeze_array
    should not.

    As 'freeze_array' currently expects one request to be pending (in
    handle_read_error - the only previous caller), we need to pass
    it the number of pending requests (extra) to ignore.

    The deadlock was made particularly noticeable by commits
    050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which
    appeared in 3.4, so the fix is appropriate for any -stable
    kernel since then.

    This patch probably won't apply directly to some early kernels and
    will need to be applied by hand.

    Cc: stable@vger.kernel.org
    Reported-by: Alexander Lyakas
    Signed-off-by: NeilBrown

    NeilBrown
     
  • …ebuilding drive completed it.

    Without that fix, the following scenario could happen:

    - RAID1 with drives A and B; drive B was freshly-added and is rebuilding
    - Drive A fails
    - WRITE request arrives to the array. It is failed by drive A, so
    r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
    succeeds in writing it, so the same r1_bio is marked as
    R1BIO_Uptodate.
    - r1_bio arrives to handle_write_finished, badblocks are disabled,
    md_error()->error() does nothing because we don't fail the last drive
    of raid1
    - raid_end_bio_io() calls call_bio_endio()
    - As a result, in call_bio_endio():
    if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
    clear_bit(BIO_UPTODATE, &bio->bi_flags);
    this code doesn't clear the BIO_UPTODATE flag, and the whole master
    WRITE succeeds, back to the upper layer.

    So we returned success to the upper layer, even though we had written
    the data onto the rebuilding drive only. But when we want to read the
    data back, we would not read from the rebuilding drive, so this data
    is lost.

    [neilb - applied identical change to raid10 as well]

    This bug can result in lost data, so it is suitable for any
    -stable kernel.

    Cc: stable@vger.kernel.org
    Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
    Signed-off-by: NeilBrown <neilb@suse.de>

    Alex Lyakas
     

09 May, 2013

1 commit

  • Pull block core updates from Jens Axboe:

    - Major bit is Kents prep work for immutable bio vecs.

    - Stable candidate fix for a scheduling-while-atomic in the queue
    bypass operation.

    - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
    discard bios.

    - Tejuns changes to convert the writeback thread pool to the generic
    workqueue mechanism.

    - Runtime PM framework, SCSI patches exists on top of these in James'
    tree.

    - A few random fixes.

    * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
    relay: move remove_buf_file inside relay_close_buf
    partitions/efi.c: replace useless kzalloc's by kmalloc's
    fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
    block: fix max discard sectors limit
    blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
    Documentation: cfq-iosched: update documentation help for cfq tunables
    writeback: expose the bdi_wq workqueue
    writeback: replace custom worker pool implementation with unbound workqueue
    writeback: remove unused bdi_pending_list
    aoe: Fix unitialized var usage
    bio-integrity: Add explicit field for owner of bip_buf
    block: Add an explicit bio flag for bios that own their bvec
    block: Add bio_alloc_pages()
    block: Convert some code to bio_for_each_segment_all()
    block: Add bio_for_each_segment_all()
    bounce: Refactor __blk_queue_bounce to not use bi_io_vec
    raid1: use bio_copy_data()
    pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
    pktcdvd: use bio_copy_data()
    block: Add bio_copy_data()
    ...

    Linus Torvalds
     

30 Apr, 2013

2 commits


24 Mar, 2013

9 commits

  • More utility code to replace stuff that's getting open coded.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown

    Kent Overstreet
     
  • More prep work for immutable bvecs:

    A few places in the code were either open coding or using the wrong
    version - fix.

    After we introduce the bvec iter, it'll no longer be possible to modify
    the biovec through bio_for_each_segment_all() - it doesn't increment a
    pointer to the current bvec, you pass in a struct bio_vec (not a
    pointer) which is updated with what the current biovec would be (taking
    into account bi_bvec_done and bi_size).

    So because of that it's more worthwhile to be consistent about
    bio_for_each_segment()/bio_for_each_segment_all() usage.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown
    CC: Alasdair Kergon
    CC: dm-devel@redhat.com
    CC: Alexander Viro

    Kent Overstreet
     
  • __bio_for_each_segment() iterates bvecs from the specified index
    instead of bio->bv_idx. Currently, the only usage is to walk all the
    bvecs after the bio has been advanced by specifying 0 index.

    For immutable bvecs, we need to split these apart;
    bio_for_each_segment() is going to have a different implementation.
    This will also help document the intent of code that's using it -
    bio_for_each_segment_all() is only legal to use for code that owns the
    bio.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: Neil Brown
    CC: Boaz Harrosh

    Kent Overstreet
     
  • This doesn't really delete any code _yet_, but once immutable bvecs are
    done we can just delete the rest of the code in that loop.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown

    Kent Overstreet
     
  • More bi_idx removal. This code was just open coding bio_clone(). This
    could probably be further improved by using bio_advance() instead of
    skipping over null pages, but that'd be a larger rework.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown

    Kent Overstreet
     
  • Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown

    Kent Overstreet
     
  • Random cleanup - this code was duplicated and it's not really specific
    to md.

    Also added the ability to return the actual error code.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown
    Acked-by: Tejun Heo

    Kent Overstreet
     
  • Bunch of places in the code weren't using it where they could be -
    this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
    into a struct bvec_iter.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: "Ed L. Cashin"
    CC: Nick Piggin
    CC: Jiri Kosina
    CC: Jim Paris
    CC: Geoff Levand
    CC: Alasdair Kergon
    CC: dm-devel@redhat.com
    CC: Neil Brown
    CC: Steven Rostedt
    Acked-by: Ed Cashin

    Kent Overstreet
     
  • Just a little convenience macro - main reason to add it now is preparing
    for immutable bio vecs, it'll reduce the size of the patch that puts
    bi_sector/bi_size/bi_idx into a struct bvec_iter.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: Lars Ellenberg
    CC: Jiri Kosina
    CC: Alasdair Kergon
    CC: dm-devel@redhat.com
    CC: Neil Brown
    CC: Martin Schwidefsky
    CC: Heiko Carstens
    CC: linux-s390@vger.kernel.org
    CC: Chris Mason
    CC: Steven Whitehouse
    Acked-by: Steven Whitehouse

    Kent Overstreet
     

26 Feb, 2013

2 commits

  • When raid1/raid10 needs to fix a read error, it first drains
    all pending requests by calling freeze_array().
    This calls flush_pending_writes() if it needs to sleep,
    but some writes may be pending in a per-process plug rather
    than in the per-array request queue.

    When raid1{,0}_unplug() moves the request from the per-process
    plug to the per-array request queue (from which
    flush_pending_writes() can flush them), it needs to wake up
    freeze_array(), or freeze_array() will never flush them and so
    it will block forever.

    So add the requires wake_up() calls.

    This bug was introduced by commit
    f54a9d0e59c4bea3db733921ca9147612a6f292c
    for raid1 and a similar commit for RAID10, and so has been present
    since linux-3.6. As the bug causes a deadlock I believe this fix is
    suitable for -stable.

    Cc: stable@vger.kernel.org (3.6.y 3.7.y 3.8.y)
    Reported-by: Tregaron Bayly
    Tested-by: Tregaron Bayly
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Set mddev queue's max_write_same_sectors to its chunk_sector value (before
    disk_stack_limits merges the underlying disk limits.) With that in place,
    be sure to handle writes coming down from the block layer that have the
    REQ_WRITE_SAME flag set. That flag needs to be copied into any newly cloned
    write bio.

    Signed-off-by: Joe Lawrence
    Acked-by: "Martin K. Petersen"
    Signed-off-by: NeilBrown

    Joe Lawrence
     

18 Dec, 2012

1 commit

  • Pull block driver update from Jens Axboe:
    "Now that the core bits are in, here are the driver bits for 3.8. The
    branch contains:

    - A huge pile of drbd bits that were dumped from the 3.7 merge
    window. Following that, it was both made perfectly clear that
    there is going to be no more over-the-wall pulls and how the
    situation on individual pulls can be improved.

    - A few cleanups from Akinobu Mita for drbd and cciss.

    - Queue improvement for loop from Lukas. This grew into adding a
    generic interface for waiting/checking an even with a specific
    lock, allowing this to be pulled out of md and now loop and drbd is
    also using it.

    - A few fixes for xen back/front block driver from Roger Pau Monne.

    - Partition improvements from Stephen Warren, allowing partiion UUID
    to be used as an identifier."

    * 'for-3.8/drivers' of git://git.kernel.dk/linux-block: (609 commits)
    drbd: update Kconfig to match current dependencies
    drbd: Fix drbdsetup wait-connect, wait-sync etc... commands
    drbd: close race between drbd_set_role and drbd_connect
    drbd: respect no-md-barriers setting also when changed online via disk-options
    drbd: Remove obsolete check
    drbd: fixup after wait_even_lock_irq() addition to generic code
    loop: Limit the number of requests in the bio list
    wait: add wait_event_lock_irq() interface
    xen-blkfront: free allocated page
    xen-blkback: move free persistent grants code
    block: partition: msdos: provide UUIDs for partitions
    init: reduce PARTUUID min length to 1 from 36
    block: store partition_meta_info.uuid as a string
    cciss: use check_signature()
    cciss: cleanup bitops usage
    drbd: use copy_highpage
    drbd: if the replication link breaks during handshake, keep retrying
    drbd: check return of kmalloc in receive_uuids
    drbd: Broadcast sync progress no more often than once per second
    drbd: don't try to clear bits once the disk has failed
    ...

    Linus Torvalds
     

30 Nov, 2012

1 commit

  • New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
    moves the private wait_event_lock_irq() macro from MD to regular wait
    includes, introduces new macro wait_event_lock_irq_cmd() instead of using
    the old method with omitting cmd parameter which is ugly and makes a use
    of new macros in the MD. It also introduces the _interruptible_ variant.

    The use of new interface is when one have a special lock to protect data
    structures used in the condition, or one also needs to invoke "cmd"
    before putting it to sleep.

    All new macros are expected to be called with the lock taken. The lock
    is released before sleep and is reacquired afterwards. We will leave the
    macro with the lock held.

    Note to DM: IMO this should also fix theoretical race on waitqueue while
    using simultaneously wait_event_lock_irq() and wait_event() because of
    lack of locking around current state setting and wait queue removal.

    Signed-off-by: Lukas Czerner
    Cc: Neil Brown
    Cc: David Howells
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Jens Axboe

    Lukas Czerner
     

27 Nov, 2012

1 commit

  • If the raid1 or raid10 unplug function gets called
    from a make_request function (which is very possible) when
    there are bios on the current->bio_list list, then it will not
    be able to successfully call bitmap_unplug() and it could
    need to submit more bios and wait for them to complete.
    But they won't complete while current->bio_list is non-empty.

    So detect that case and handle the unplugging off to another thread
    just like we already do when called from within the scheduler.

    RAID1 version of bug was introduced in 3.6, so that part of fix is
    suitable for 3.6.y. RAID10 part won't apply.

    Cc: stable@vger.kernel.org
    Reported-by: Torsten Kaiser
    Reported-by: Peter Maloney
    Signed-off-by: NeilBrown

    NeilBrown
     

31 Oct, 2012

1 commit

  • setup_conf in raid1.c uses conf->raid_disks before assigning
    a value. It is used when including 'Replacement' devices.

    The consequence is that assembling an array which contains a
    replacement will misbehave and either not include the replacement, or
    not include the device being replaced.

    Though this doesn't lead directly to data corruption, it could lead to
    reduced data safety.

    So use mddev->raid_disks, which is initialised, instead.

    Bug was introduced by commit c19d57980b38a5bb613a898937a1cf85f422fb9b
    md/raid1: recognise replacements when assembling arrays.

    in 3.3, so fix is suitable for 3.3.y thru 3.6.y.

    Cc: stable@vger.kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     

11 Oct, 2012

3 commits