25 May, 2008

1 commit

  • When we get any IO error during a recovery (rebuilding a spare), we abort
    the recovery and restart it.

    For RAID6 (and multi-drive RAID1) it may not be best to restart at the
    beginning: when multiple failures can be tolerated, the recovery may be
    able to continue and re-doing all that has already been done doesn't make
    sense.

    We already have the infrastructure to record where a recovery is up to
    and restart from there, but it is not being used properly.
    This is because:
    - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
    which causes the recovery not be be checkpointed.
    - We remove spares and then re-added them which loses important state
    information.

    The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
    needed. If there is an error, the relevant drive will be marked as
    Faulty, and that is enough to ensure correct handling of the error. So we
    first remove MD_RECOVERY_ERR, changing some of the uses of it to
    MD_RECOVERY_INTR.

    Then we cause the attempt to remove a non-faulty device from an array to
    fail (unless recovery is impossible as the array is too degraded). Then
    when remove_and_add_spares attempts to remove the devices on which
    recovery can continue, it will fail, they will remain in place, and
    recovery will continue on them as desired.

    Issue: If we are halfway through rebuilding a spare and another drive
    fails, and a new spare is immediately available, do we want to:
    1/ complete the current rebuild, then go back and rebuild the new spare or
    2/ restart the rebuild from the start and rebuild both devices in
    parallel.

    Both options can be argued for. The code currently takes option 2 as
    a/ this requires least code change
    b/ this results in a minimally-degraded array in minimal time.

    Cc: "Eivind Sarto"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

15 May, 2008

1 commit

  • As setting and clearing queue flags now requires that we hold a spinlock
    on the queue, and as blk_queue_stack_limits is called without that lock,
    get the lock inside blk_queue_stack_limits.

    For blk_queue_stack_limits to be able to find the right lock, each md
    personality needs to set q->queue_lock to point to the appropriate lock.
    Those personalities which didn't previously use a spin_lock, us
    q->__queue_lock. So always initialise that lock when allocated.

    With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
    longer cause warnings as it will be clear that the proper lock is held.

    Thanks to Dan Williams for review and fixing the silly bugs.

    Signed-off-by: NeilBrown
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Alistair John Strachan
    Cc: Nick Piggin
    Cc: "Rafael J. Wysocki"
    Cc: Jacek Luczak
    Cc: Prakash Punnoor
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

09 May, 2008

1 commit

  • drivers/md/raid10.c:889:17: warning: Using plain integer as NULL pointer
    drivers/media/video/cx18/cx18-driver.c:616:12: warning: Using plain integer as NULL pointer
    sound/oss/kahlua.c:70:12: warning: Using plain integer as NULL pointer

    Signed-off-by: Harvey Harrison
    Cc: Neil Brown
    Cc: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     

30 Apr, 2008

1 commit

  • Allows a userspace metadata handler to take action upon detecting a device
    failure.

    Based on an original patch by Neil Brown.

    Changes:
    -added blocked_wait waitqueue to rdev
    -don't qualify Blocked with Faulty always let userspace block writes
    -added md_wait_for_blocked_rdev to wait for the block device to be clear, if
    userspace misses the notification another one is sent every 5 seconds
    -set MD_RECOVERY_NEEDED after clearing "blocked"
    -kill DoBlock flag, just test mddev->external

    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

28 Apr, 2008

1 commit

  • MD drivers use one printk() call to print 2 log messages and the second line
    may be prefixed by a TAB character. It may also output a trailing space
    before newline. klogd (I think) turns the TAB character into the 2 characters
    '^I' when logging to a file. This looks ugly.

    Instead of a leading TAB to indicate continuation, prefix both output lines
    with 'raid:' or similar. Also remove any trailing space in the vicinity of
    the affected code and consistently end the sentences with a period.

    Signed-off-by: Nick Andrew
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Andrew
     

05 Mar, 2008

4 commits

  • This message describes another issue about md RAID10 found by testing the
    2.6.24 md RAID10 using new scsi fault injection framework.

    Abstract:

    When a scsi error results in disabling a disk during RAID10 recovery, the
    resync threads of md RAID10 could stall.

    This case, the raid array has already been broken and it may not matter. But
    I think stall is not preferable. If it occurs, even shutdown or reboot will
    fail because of resource busy.

    The deadlock mechanism:

    The r10bio_s structure has a "remaining" member to keep track of BIOs yet to
    be handled when recovering. The "remaining" counter is incremented when
    building a BIO in sync_request() and is decremented when finish a BIO in
    end_sync_write().

    If building a BIO fails for some reasons in sync_request(), the "remaining"
    should be decremented if it has already been incremented. I found a case
    where this decrement is forgotten. This causes a md_do_sync() deadlock
    because md_do_sync() waits for md_done_sync() called by end_sync_write(), but
    end_sync_write() never calls md_done_sync() because of the "remaining" counter
    mismatch.

    For example, this problem would be reproduced in the following case:

    Personalities : [raid10]
    md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
    3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
    [>....................] recovery = 2.2% (45376/1959808) finish=0.7min speed=45376K/sec

    This case, sdf1 is recovering, sdb1 and sde1 are disabled.
    An additional error with detaching sdd will cause a deadlock.

    md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
    3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
    [=>...................] recovery = 5.0% (99520/1959808) finish=5.9min speed=5237K/sec

    2739 ? S< 0:17 [md0_raid10]
    28608 ? D< 0:00 [md0_resync]
    28629 pts/1 Ss 0:00 bash
    28830 pts/1 R+ 0:00 ps ax
    31819 ? D< 0:00 [kjournald]

    The resync thread keeps working, but actually it is deadlocked.

    Patch:
    By this patch, the remaining counter will be decremented if needed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    K.Tanaka
     
  • Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
    another possible deadlock in raid1/raid10 error handing.

    If a read request returns an error while a resync is happening and a resync
    request is pending, the attempt to fix the error will block until the resync
    progresses, and the resync will block until the read request completes. Thus
    a deadlock.

    This patch fixes the problem.

    Cc: "K.Tanaka"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This patch changes the disk to be read for layout "far > 1" to always be the
    disk with the lowest block address.

    Thus the chunks to be read will always be (for a fully functioning array) from
    the first band of stripes, and the raid will then work as a raid0 consisting
    of the first band of stripes.

    Some advantages:

    The fastest part which is the outer sectors of the disks involved will be
    used. The outer blocks of a disk may be as much as 100 % faster than the
    inner blocks.

    Average seek time will be smaller, as seeks will always be confined to the
    first part of the disks.

    Mixed disks with different performance characteristics will work better, as
    they will work as raid0, the sequential read rate will be number of disks
    involved times the IO rate of the slowest disk.

    If a disk is malfunctioning, the first disk which is working, and has the
    lowest block address for the logical block will be used.

    Signed-off-by: Keld Simonsen
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keld Simonsen
     
  • When handling a read error, we freeze the array to stop any other IO while
    attempting to over-write with correct data.

    This is done in the raid1d(raid10d) thread and must wait for all submitted IO
    to complete (except for requests that failed and are sitting in the retry
    queue - these are counted in ->nr_queue and will stay there during a freeze).

    However write requests need attention from raid1d as bitmap updates might be
    required. This can cause a deadlock as raid1 is waiting for requests to
    finish that themselves need attention from raid1d.

    So we create a new function 'flush_pending_writes' to give that attention, and
    call it in freeze_array to be sure that we aren't waiting on raid1d.

    Thanks to "K.Tanaka" for finding and reporting this
    problem.

    Cc: "K.Tanaka"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

07 Feb, 2008

3 commits

  • As this is more in line with common practice in the kernel. Also swap the
    args around to be more like list_for_each.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This allows userspace to control resync/reshape progress and synchronise it
    with other activities, such as shared access in a SAN, or backing up critical
    sections during a tricky reshape.

    Writing a number of sectors (which must be a multiple of the chunk size if
    such is meaningful) causes a resync to pause when it gets to that point.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Currently an md array with a write-intent bitmap does not updated that bitmap
    to reflect successful partial resync. Rather the entire bitmap is updated
    when the resync completes.

    This is because there is no guarentee that resync requests will complete in
    order, and tracking each request individually is unnecessarily burdensome.

    However there is value in regularly updating the bitmap, so add code to
    periodically pause while all pending sync requests complete, then update the
    bitmap. Doing this only every few seconds (the same as the bitmap update
    time) does not notciably affect resync performance.

    [snitzer@gmail.com: export bitmap_cond_end_sync]
    Signed-off-by: Neil Brown
    Cc: "Mike Snitzer"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

09 Nov, 2007

1 commit


16 Oct, 2007

1 commit


10 Oct, 2007

1 commit

  • As bi_end_io is only called once when the reqeust is complete,
    the 'size' argument is now redundant. Remove it.

    Now there is no need for bio_endio to subtract the size completed
    from bi_size. So don't do that either.

    While we are at it, change bi_end_io to return void.

    Signed-off-by: Neil Brown
    Signed-off-by: Jens Axboe

    NeilBrown
     

01 Aug, 2007

2 commits


24 Jul, 2007

1 commit

  • Some of the code has been gradually transitioned to using the proper
    struct request_queue, but there's lots left. So do a full sweet of
    the kernel and get rid of this typedef and replace its uses with
    the proper type.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Jul, 2007

1 commit

  • bitmap_unplug only ever returns 0, so it may as well be void. Two callers try
    to print a message if it returns non-zero, but that message is already printed
    by bitmap_file_kick.

    write_page returns an error which is not consistently checked. It always
    causes BITMAP_WRITE_ERROR to be set on an error, and that can more
    conveniently be checked.

    When the return of write_page is checked, an error causes bitmap_file_kick to
    be called - so move that call into write_page - and protect against recursive
    calls into bitmap_file_kick.

    bitmap_update_sb returns an error that is never checked.

    So make these 'void' and be consistent about checking the bit.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

17 Jun, 2007

1 commit

  • 1/ When resyncing a degraded raid10 which has more than 2 copies of each block,
    garbage can get synced on top of good data.

    2/ We round the wrong way in part of the device size calculation, which
    can cause confusion.

    Signed-off-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

02 Mar, 2007

1 commit

  • There are two errors that can lead to recovery problems with raid10
    when used in 'far' more (not the default).

    Due to a '>' instead of '>=' the wrong block is located which would result in
    garbage being written to some random location, quite possible outside the
    range of the device, causing the newly reconstructed device to fail.

    The device size calculation had some rounding errors (it didn't round when it
    should) and so recovery would go a few blocks too far which would again cause
    a write to a random block address and probably a device error.

    The code for working with device sizes was fairly confused and spread out, so
    this has been tided up a bit.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

12 Jan, 2007

1 commit

  • md raidX make_request functions strip off the BIO_RW_SYNC flag, thus
    introducing additional latency.

    Fixing this in raid1 and raid10 seems to be straightforward enough.

    For our particular usage case in DRBD, passing this flag improved some
    initialization time from ~5 minutes to ~5 seconds.

    Acked-by: NeilBrown
    Signed-off-by: Lars Ellenberg
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lars Ellenberg
     

14 Dec, 2006

1 commit


29 Oct, 2006

1 commit


22 Oct, 2006

1 commit


03 Oct, 2006

5 commits

  • raid1, raid10 and multipath don't report their 'congested' status through
    bdi_*_congested, but should.

    This patch adds the appropriate functions which just check the 'congested'
    status of all active members (with appropriate locking).

    raid1 read_balance should be modified to prefer devices where
    bdi_read_congested returns false. Then we could use the '&' branch rather
    than the '|' branch. However that should would need some benchmarking first
    to make sure it is actually a good idea.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The error handling routines don't use proper locking, and so two concurrent
    errors could trigger a problem.

    So:
    - use test-and-set and test-and-clear to synchonise
    the In_sync bits with the ->degraded count
    - use the spinlock to protect updates to the
    degraded count (could use an atomic_t but that
    would be a bigger change in code, and isn't
    really justified)
    - remove un-necessary locking in raid5

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • It isn't needed as mddev->degraded contains equivalent info.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Instead of magic numbers (0,1,2,3) in sb_dirty, we have
    some flags instead:
    MD_CHANGE_DEVS
    Some device state has changed requiring superblock update
    on all devices.
    MD_CHANGE_CLEAN
    The array has transitions from 'clean' to 'dirty' or back,
    requiring a superblock update on active devices, but possibly
    not on spares
    MD_CHANGE_PENDING
    A superblock update is underway.

    We wait for an update to complete by waiting for all flags to be clear. A
    flag can be set at any time, even during an update, without risk that the
    change will be lost.

    Stop exporting md_update_sb - isn't needed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • raid10d has toooo many nested block, so take the fix_read_error functionality
    out into a separate function.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

11 Jul, 2006

1 commit


27 Jun, 2006

4 commits

  • The size calculation made assumtion which the new offset mode didn't
    follow. This gets the size right in all cases.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The "industry standard" DDF format allows for a stripe/offset layout where
    data is duplicated on different stripes. e.g.

    A B C D
    D A B C
    E F G H
    H E F G

    (columns are drives, rows are stripes, LETTERS are chunks of data).

    This is similar to raid10's 'far' mode, but not quite the same. So enhance
    'far' mode with a 'far/offset' option which follows the layout of DDFs
    stripe/offset.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • For a while we have had checkpointing of resync. The version-1 superblock
    allows recovery to be checkpointed as well, and this patch implements that.

    Due to early carelessness we need to add a feature flag to signal that the
    recovery_offset field is in use, otherwise older kernels would assume that a
    partially recovered array is in fact fully recovered.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The largest chunk size the code can support without substantial surgery is
    2^30 bytes, so make that the limit instead of an arbitrary 4Meg. Some day,
    the 'chunksize' should change to a sector-shift instead of a byte-count. Then
    no limit would be needed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

02 May, 2006

2 commits


02 Apr, 2006

1 commit


04 Feb, 2006

1 commit

  • - version-1 superblock
    + The default_bitmap_offset is in sectors, not bytes.
    + the 'size' field in the superblock is in sectors, not KB
    - raid0_run should return a negative number on error, not '1'
    - raid10_read_balance should not return a valid 'disk' number if
    ->rdev turned out to be NULL
    - kmem_cache_destroy doesn't like being passed a NULL.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

15 Jan, 2006

1 commit