15 May, 2008

1 commit

  • As setting and clearing queue flags now requires that we hold a spinlock
    on the queue, and as blk_queue_stack_limits is called without that lock,
    get the lock inside blk_queue_stack_limits.

    For blk_queue_stack_limits to be able to find the right lock, each md
    personality needs to set q->queue_lock to point to the appropriate lock.
    Those personalities which didn't previously use a spin_lock, us
    q->__queue_lock. So always initialise that lock when allocated.

    With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
    longer cause warnings as it will be clear that the proper lock is held.

    Thanks to Dan Williams for review and fixing the silly bugs.

    Signed-off-by: NeilBrown
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Alistair John Strachan
    Cc: Nick Piggin
    Cc: "Rafael J. Wysocki"
    Cc: Jacek Luczak
    Cc: Prakash Punnoor
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

30 Apr, 2008

1 commit

  • Allows a userspace metadata handler to take action upon detecting a device
    failure.

    Based on an original patch by Neil Brown.

    Changes:
    -added blocked_wait waitqueue to rdev
    -don't qualify Blocked with Faulty always let userspace block writes
    -added md_wait_for_blocked_rdev to wait for the block device to be clear, if
    userspace misses the notification another one is sent every 5 seconds
    -set MD_RECOVERY_NEEDED after clearing "blocked"
    -kill DoBlock flag, just test mddev->external

    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

28 Apr, 2008

1 commit

  • MD drivers use one printk() call to print 2 log messages and the second line
    may be prefixed by a TAB character. It may also output a trailing space
    before newline. klogd (I think) turns the TAB character into the 2 characters
    '^I' when logging to a file. This looks ugly.

    Instead of a leading TAB to indicate continuation, prefix both output lines
    with 'raid:' or similar. Also remove any trailing space in the vicinity of
    the affected code and consistently end the sentences with a period.

    Signed-off-by: Nick Andrew
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Andrew
     

05 Mar, 2008

2 commits

  • Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
    another possible deadlock in raid1/raid10 error handing.

    If a read request returns an error while a resync is happening and a resync
    request is pending, the attempt to fix the error will block until the resync
    progresses, and the resync will block until the read request completes. Thus
    a deadlock.

    This patch fixes the problem.

    Cc: "K.Tanaka"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • When handling a read error, we freeze the array to stop any other IO while
    attempting to over-write with correct data.

    This is done in the raid1d(raid10d) thread and must wait for all submitted IO
    to complete (except for requests that failed and are sitting in the retry
    queue - these are counted in ->nr_queue and will stay there during a freeze).

    However write requests need attention from raid1d as bitmap updates might be
    required. This can cause a deadlock as raid1 is waiting for requests to
    finish that themselves need attention from raid1d.

    So we create a new function 'flush_pending_writes' to give that attention, and
    call it in freeze_array to be sure that we aren't waiting on raid1d.

    Thanks to "K.Tanaka" for finding and reporting this
    problem.

    Cc: "K.Tanaka"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

07 Feb, 2008

3 commits

  • As this is more in line with common practice in the kernel. Also swap the
    args around to be more like list_for_each.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • This allows userspace to control resync/reshape progress and synchronise it
    with other activities, such as shared access in a SAN, or backing up critical
    sections during a tricky reshape.

    Writing a number of sectors (which must be a multiple of the chunk size if
    such is meaningful) causes a resync to pause when it gets to that point.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Currently an md array with a write-intent bitmap does not updated that bitmap
    to reflect successful partial resync. Rather the entire bitmap is updated
    when the resync completes.

    This is because there is no guarentee that resync requests will complete in
    order, and tracking each request individually is unnecessarily burdensome.

    However there is value in regularly updating the bitmap, so add code to
    periodically pause while all pending sync requests complete, then update the
    bitmap. Doing this only every few seconds (the same as the bitmap update
    time) does not notciably affect resync performance.

    [snitzer@gmail.com: export bitmap_cond_end_sync]
    Signed-off-by: Neil Brown
    Cc: "Mike Snitzer"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

09 Nov, 2007

1 commit


20 Oct, 2007

1 commit

  • * Convert files to UTF-8.

    * Also correct some people's names
    (one example is Eißfeldt, which was found in a source file.
    Given that the author used an ß at all in a source file
    indicates that the real name has in fact a 'ß' and not an 'ss',
    which is commonly used as a substitute for 'ß' when limited to
    7bit.)

    * Correct town names (Goettingen -> Göttingen)

    * Update Eberhard Mönkeberg's address (http://lkml.org/lkml/2007/1/8/313)

    Signed-off-by: Jan Engelhardt
    Signed-off-by: Adrian Bunk

    Jan Engelhardt
     

17 Oct, 2007

1 commit


16 Oct, 2007

1 commit


10 Oct, 2007

1 commit

  • As bi_end_io is only called once when the reqeust is complete,
    the 'size' argument is now redundant. Remove it.

    Now there is no need for bio_endio to subtract the size completed
    from bi_size. So don't do that either.

    While we are at it, change bi_end_io to return void.

    Signed-off-by: Neil Brown
    Signed-off-by: Jens Axboe

    NeilBrown
     

23 Aug, 2007

2 commits

  • When a raid1 array is reshaped (number of drives changed), the list of devices
    is compacted, so that slots for missing devices are filled with working
    devices from later slots. This requires the "rd%d" symlinks in sysfs to be
    updated.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Commit 1757128438d41670ded8bc3bc735325cc07dc8f9 was slightly bad. If an array
    has a write-intent bitmap, and you remove a drive, then readd it, only the
    changed parts should be resynced. However after the above commit, this only
    works if the array has not been shut down and restarted.

    This is because it sets 'fullsync' at little more often than it should. This
    patch is more careful.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

24 Jul, 2007

1 commit

  • Some of the code has been gradually transitioned to using the proper
    struct request_queue, but there's lots left. So do a full sweet of
    the kernel and get rid of this typedef and replace its uses with
    the proper type.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Jul, 2007

1 commit

  • bitmap_unplug only ever returns 0, so it may as well be void. Two callers try
    to print a message if it returns non-zero, but that message is already printed
    by bitmap_file_kick.

    write_page returns an error which is not consistently checked. It always
    causes BITMAP_WRITE_ERROR to be set on an error, and that can more
    conveniently be checked.

    When the return of write_page is checked, an error causes bitmap_file_kick to
    be called - so move that call into write_page - and protect against recursive
    calls into bitmap_file_kick.

    bitmap_update_sb returns an error that is never checked.

    So make these 'void' and be consistent about checking the bit.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

17 Jun, 2007

1 commit

  • If raid1/repair (which reads all block and fixes any differences it finds)
    hits a read error, it doesn't reset the bio for writing before writing
    correct data back, so the read error isn't fixed, and the device probably
    gets a zero-length write which it might complain about.

    Signed-off-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Accetta
     

11 May, 2007

1 commit

  • When a raid1 has only one working drive, we want read error to propagate up
    to the filesystem as there is no point failing the last drive in an array.

    Currently the code perform this check is racy. If a write and a read a
    both submitted to a device on a 2-drive raid1, and the write fails followed
    by the read failing, the read will see that there is only one working drive
    and will pass the failure up, even though the one working drive is actually
    the *other* one.

    So, tighten up the locking.

    Signed-off-by: Neil Brown
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

10 May, 2007

2 commits

  • This reverts commit 5b479c91da90eef605f851508744bfe8269591a0.

    Quoth Neil Brown:

    "It causes an oops when auto-detecting raid arrays, and it doesn't
    seem easy to fix.

    The array may not be 'open' when do_md_run is called, so
    bdev->bd_disk might be NULL, so bd_set_size can oops.

    This whole approach of opening an md device before it has been
    assembled just seems to get more and more painful. I think I'm going
    to have to come up with something clever to provide both backward
    comparability with usage expectation, and sane integration into the
    rest of the kernel."

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • md currently uses ->media_changed to make sure rescan_partitions
    is call on md array after they are assembled.

    However that doesn't happen until the array is opened, which is later
    than some people would like.

    So use blkdev_ioctl to do the rescan immediately that the
    array has been assembled.

    This means we can remove all the ->change infrastructure as it was only used
    to trigger a partition rescan.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

27 Jan, 2007

2 commits

  • If a GFP_KERNEL allocation is attempted in md while the mddev_lock is held,
    it is possible for a deadlock to eventuate.

    This happens if the array was marked 'clean', and the memalloc triggers a
    write-out to the md device.

    For the writeout to succeed, the array must be marked 'dirty', and that
    requires getting the mddev_lock.

    So, before attempting a GFP_KERNEL allocation while holding the lock, make
    sure the array is marked 'dirty' (unless it is currently read-only).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • When 'repair' finds a block that is different one the various parts of the
    mirror. it is meant to write a chosen good version to the others. However it
    currently writes out the original data to each. The memcpy to make all the
    data the same is missing.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

12 Jan, 2007

1 commit

  • md raidX make_request functions strip off the BIO_RW_SYNC flag, thus
    introducing additional latency.

    Fixing this in raid1 and raid10 seems to be straightforward enough.

    For our particular usage case in DRBD, passing this flag improved some
    initialization time from ~5 minutes to ~5 seconds.

    Acked-by: NeilBrown
    Signed-off-by: Lars Ellenberg
    Acked-by: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lars Ellenberg
     

14 Dec, 2006

1 commit


11 Dec, 2006

1 commit

  • Fix few bugs that meant that:
    - superblocks weren't alway written at exactly the right time (this
    could show up if the array was not written to - writting to the array
    causes lots of superblock updates and so hides these errors).

    - restarting device recovery after a clean shutdown (version-1 metadata
    only) didn't work as intended (or at all).

    1/ Ensure superblock is updated when a new device is added.
    2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync.
    The body of this if takes one of two branches depending on whether
    MD_RECOVERY_SYNC is set, so testing it in the clause of the if
    is wrong.
    3/ Flag superblock for updating after a resync/recovery finishes.
    4/ If we find the neeed to restart a recovery in the middle (version-1
    metadata only) make sure a full recovery (not just as guided by
    bitmaps) does get done.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

29 Oct, 2006

1 commit


03 Oct, 2006

5 commits

  • raid1, raid10 and multipath don't report their 'congested' status through
    bdi_*_congested, but should.

    This patch adds the appropriate functions which just check the 'congested'
    status of all active members (with appropriate locking).

    raid1 read_balance should be modified to prefer devices where
    bdi_read_congested returns false. Then we could use the '&' branch rather
    than the '|' branch. However that should would need some benchmarking first
    to make sure it is actually a good idea.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • The error handling routines don't use proper locking, and so two concurrent
    errors could trigger a problem.

    So:
    - use test-and-set and test-and-clear to synchonise
    the In_sync bits with the ->degraded count
    - use the spinlock to protect updates to the
    degraded count (could use an atomic_t but that
    would be a bigger change in code, and isn't
    really justified)
    - remove un-necessary locking in raid5

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • It is equivalent to conf->raid_disks - conf->mddev->degraded.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • raid1d has toooo many nested block, so take the fix_read_error functionality
    out into a separate function.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Instead of magic numbers (0,1,2,3) in sb_dirty, we have
    some flags instead:
    MD_CHANGE_DEVS
    Some device state has changed requiring superblock update
    on all devices.
    MD_CHANGE_CLEAN
    The array has transitions from 'clean' to 'dirty' or back,
    requiring a superblock update on active devices, but possibly
    not on spares
    MD_CHANGE_PENDING
    A superblock update is underway.

    We wait for an update to complete by waiting for all flags to be clear. A
    flag can be set at any time, even during an update, without risk that the
    change will be lost.

    Stop exporting md_update_sb - isn't needed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

02 Sep, 2006

1 commit

  • We need to be careful when referencing mirrors[i].rdev. It can disappear
    under us at various times.

    So:
    fix a couple of problem places.
    comment a couple of non-problem places
    move an 'atomic_add' which deferences rdev down a little
    way to some where where it is sure to not be NULL.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

28 Aug, 2006

1 commit


11 Jul, 2006

2 commits


27 Jun, 2006

3 commits

  • When an array has a bitmap, a device can be removed and re-added and only
    blocks changes since the removal (as recorded in the bitmap) will be resynced.

    It should be possible to do a similar thing to arrays without bitmaps. i.e.
    if a device is removed and re-added and *no* changes have been made in the
    interim, then the add should not require a resync.

    This patch allows that option. This means that when assembling an array one
    device at a time (e.g. during device discovery) the array can be enabled
    read-only as soon as enough devices are available, but extra devices can still
    be added without causing a resync.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • For a while we have had checkpointing of resync. The version-1 superblock
    allows recovery to be checkpointed as well, and this patch implements that.

    Due to early carelessness we need to add a feature flag to signal that the
    recovery_offset field is in use, otherwise older kernels would assume that a
    partially recovered array is in fact fully recovered.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • A recent change made this goto unnecessary, so reformat the code to make it
    clearer what is happening.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

02 May, 2006

1 commit

  • When retrying a failed BIO_RW_BARRIER request, we need to keep the reference
    in ->nr_pending over the whole retry. Currently, we only hold the reference
    if the failed request is the *last* one to finish - which is silly, because it
    would normally be the first to finish.

    So move the rdev_dec_pending call up into the didn't-fail branch. As the rdev
    isn't used in the later code, calling rdev_dec_pending earlier doesn't hurt.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown