16 Jan, 2012

1 commit

  • * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
    Revert "block: recursive merge requests"
    block: Stop using macro stubs for the bio data integrity calls
    blockdev: convert some macros to static inlines
    fs: remove unneeded plug in mpage_readpages()
    block: Add BLKROTATIONAL ioctl
    block: Introduce blk_set_stacking_limits function
    block: remove WARN_ON_ONCE() in exit_io_context()
    block: an exiting task should be allowed to create io_context
    block: ioc_cgroup_changed() needs to be exported
    block: recursive merge requests
    block, cfq: fix empty queue crash caused by request merge
    block, cfq: move icq creation and rq->elv.icq association to block core
    block, cfq: restructure io_cq creation path for io_context interface cleanup
    block, cfq: move io_cq exit/release to blk-ioc.c
    block, cfq: move icq cache management to block core
    block, cfq: move io_cq lookup to blk-ioc.c
    block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
    block, cfq: reorganize cfq_io_context into generic and cfq specific parts
    block: remove elevator_queue->ops
    block: reorder elevator switch sequence
    ...

    Fix up conflicts in:
    - block/blk-cgroup.c
    Switch from can_attach_task to can_attach
    - block/cfq-iosched.c
    conflict with now removed cic index changes (we now use q->id instead)

    Linus Torvalds
     

15 Jan, 2012

1 commit


12 Jan, 2012

1 commit

  • Two bugfixes for md.

    One is a recently introduced regression that affects an unusual
    configuration with a guaranteed BUG_ON. Has been tagged for -stable.
    The other is minor missing functionality.

    * tag 'md-3.3-fixes' of git://neil.brown.name/md:
    md/raid1: perform bad-block tests for WriteMostly devices too.
    md: notify the 'degraded' sysfs attribute on failure.

    Linus Torvalds
     

11 Jan, 2012

3 commits

  • Stacking driver queue limits are typically bounded exclusively by the
    capabilities of the low level devices, not by the stacking driver
    itself.

    This patch introduces blk_set_stacking_limits() which has more liberal
    metrics than the default queue limits function. This allows us to
    inherit topology parameters from bottom devices without manually
    tweaking the default limits in each driver prior to calling the stacking
    function.

    Since there is now a clear distinction between stacking and low-level
    devices, blk_set_default_limits() has been modified to carry the more
    conservative values that we used to manually set in
    blk_queue_make_request().

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • We normally try to avoid reading from write-mostly devices, but when
    we do we really have to check for bad blocks and be sure not to
    try reading them.

    With the current code, best_good_sectors might not get set and that
    causes zero-length read requests to be send down which is very
    confusing.

    This bug was introduced in commit d2eb35acfdccbe2 and so the patch
    is suitable for 3.1.x and 3.2.x

    Reported-and-tested-by: Michał Mirosław
    Reported-and-tested-by: Art -kwaak- van Breemen
    Signed-off-by: NeilBrown
    Cc: stable@vger.kernel.org

    NeilBrown
     
  • We currently only 'notify' changes to the 'degraded' attribute
    when it decreases, not when it increases.

    Notifying on failure is a little awkward as it happen in
    interrupt context.
    So instead, notify when we remove the failed device from the array,
    which is very soon afterwards.

    Reported-and-tested-by: Mikhail Balabin
    Signed-off-by: NeilBrown

    NeilBrown
     

09 Jan, 2012

1 commit

  • md update for 3.3

    Big change is new hot-replacement.
    A slot in an array can hold 2 devices - one that
    wants-replacement and one that is the replacement.
    Once the replacement is built - either from the
    original or (in the case of errors) from elsewhere,
    the wants-replacement device will be removed.

    * tag 'md-3.3' of git://neil.brown.name/md: (36 commits)
    md/raid1: Mark device want_replacement when we see a write error.
    md/raid1: If there is a spare and a want_replacement device, start replacement.
    md/raid1: recognise replacements when assembling arrays.
    md/raid1: handle activation of replacement device when recovery completes.
    md/raid1: Allow a failed replacement device to be removed.
    md/raid1: Allocate spare to store replacement devices and their bios.
    md/raid1: Replace use of mddev->raid_disks with conf->raid_disks.
    md/raid10: If there is a spare and a want_replacement device, start replacement.
    md/raid10: recognise replacements when assembling array.
    md/raid10: Allow replacement device to be replace old drive.
    md/raid10: handle recovery of replacement devices.
    md/raid10: Handle replacement devices during resync.
    md/raid10: writes should get directed to replacement as well as original.
    md/raid10: allow removal of failed replacement devices.
    md/raid10: preferentially read from replacement device if possible.
    md/raid10: change read_balance to return an rdev
    md/raid10: prepare data structures for handling replacement.
    md/raid5: Mark device want_replacement when we see a write error.
    md/raid5: If there is a spare and a want_replacement device, start replacement.
    md/raid5: recognise replacements when assembling array.
    ...

    Linus Torvalds
     

04 Jan, 2012

1 commit

  • Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
    kill_bdev as well, so brd doesn't have to open code it. Reduce
    buffer_head.h requirement accordingly.

    Removed a rather large comment from invalidate_bdev, as it looked a bit
    obsolete to bother moving. The small comment replacing it says enough.

    Signed-off-by: Nick Piggin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Al Viro
     

23 Dec, 2011

32 commits

  • Now that WantReplacement drives are replaced cleanly, mark a drive
    as want_replacement when we see a write error. It might get failed soon so
    the WantReplacement flag is irrelevant, but if the write error is recorded
    in the bad block log, we still want to activate any spare that might
    be available.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When attempting to add a spare to a RAID1 array, also consider
    adding it as a replacement for a want_replacement device.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If a Replacement is seen, file it as such.

    If we see two replacements (or two normal devices) for the one slot,
    abort.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When recovery completes ->spare_active is called.
    This checks if the replacement is ready and if so it fails
    the original.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Replacement devices are stored at a different offset, so look
    there too.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • In RAID1, a replacement is much like a normal device, so we just
    double the size of the relevant arrays and look at all possible
    devices for reads and writes.

    This means that the array looks like it is now double the size in some
    way - we need to be careful about that.
    In particular, we checking if the array is still degraded while
    creating a recovery request we need to only consider the first 'half'
    - i.e. the real (non-replacement) devices.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • In general mddev->raid_disks can change unexpectedly while
    conf->raid_disks will only change in a very controlled way. So change
    some uses of one to the other.

    The use of mddev->raid_disks will not cause actually problems but
    this way is more consistent and safer in the long term.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When attempting to add a spare to a RAID10 array, also consider
    adding it as a replacement for a want_replacement device.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If a Replacement is seen, file it as such.

    If we see two replacements (or two normal devices) for the one slot,
    abort.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When recovery finish and spare_active is called, check for a
    replace that might have just become fully synced and mark it
    as such, marking the original as failed.

    Then when the original is removed, move the replacement into
    its position.

    This means that 'replacement' and spontaneously become NULL in some
    situations. Make sure we check for those.
    It also means that 'rdev' and 'replacement' could appear to be
    identical - check for that too.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If there is a replacement device, then recover to it,
    reading from any drives - maybe the one being replaced, maybe not.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If we need to resync an array which has replacement devices,
    we always write any block checked to every replacement.

    If the resync was bitmap-based resync we will then complete the
    replacement normally.
    If it was a full resync, we mark the replacements as fully recovered
    when the resync finishes so no further recovery is needed.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When writing, we need to submit two writes, one to the original,
    and one to the replacements - if there is a replacement.

    If the write to the replacement results in a write error we just
    fail the device. We only try to record write errors to the
    original.

    This only handles writing new data. Writing for resync/recovery
    will come later.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Enhance raid10_remove_disk to be able to remove ->replacement
    as well as ->rdev

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When reading (for array reads, not for recovery etc) we read from the
    replacement device if it has recovered far enough.
    This requires storing the chosen rdev in the 'r10_bio' so we can make
    sure to drop the ref on the right device when the read finishes.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • It makes more sense to return an rdev than just an index as
    read_balance() gets a reference to the rdev and so returning
    the pointer make this more idiomatic.

    This will be needed in a future patch when we might return
    a 'replacement' rdev instead of the main rdev.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Allow each slot in the RAID10 to have 2 devices, the want_replacement
    and the replacement.

    Also an r10bio to have 2 bios, and for resync/recovery allocate the
    second bio if there are any replacement devices.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Now that WantReplacement drives are replaced cleanly, mark a drive
    as WantReplacement when we see a write error. It might get failed soon so
    the WantReplacement flag is irrelevant, but if the write error is recorded
    in the bad block log, we still want to activate any spare that might
    be available.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When attempting to add a spare to a RAID[456] array, also consider
    adding it as a replacement for a want_replacement device.

    This requires that common md code attempt hot_add even when the array
    is not formally degraded.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • If a Replacement is seen, file it as such.

    If we see two replacements (or two normal devices) for the one slot,
    abort.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When recovery completes - as reported by a call to ->spare_active,
    we clear In_sync on the original and set it on the replacement.

    Then when the original gets removed we move the replacement from
    'replacement' to 'rdev'.

    This could race with other code that is looking at these pointers,
    so we use memory barriers and careful ordering to ensure that
    a reader might see one device twice, but never no devices.
    Then the readers guard against using both devices, which could
    only happen when writing.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • During recovery we want to write to the replacement but not
    the original. So we have two new flags
    - R5_NeedReplace if this stripe has a replacement that needs to
    be written at some stage
    - R5_WantReplace if NeedReplace, and the data is available, and
    a 'sync' has been requested on this stripe.

    We also distinguish between 'sync and replace' which need to read
    all other devices, and 'replace' which only needs to read the
    devices being replaced.

    Note that during resync we always write to any replacement device.
    It might not need to be written to, but as we don't read to compare,
    we have to write to be sure.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When writing, we need to submit two writes, one to the original, and
    one to the replacement - if there is a replacement.

    If the write to the replacement results in a write error, we just fail
    the device. We only try to record write errors to the original.

    When writing for recovery, we shouldn't write to the original. This
    will be addressed in a subsequent patch that generally addresses
    recovery.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Enhance raid5_remove_disk to be able to remove ->replacement
    as well as ->rdev.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • If a replacement device is present and has been recovered far enough,
    then use it for reading into the stripe cache.

    If we get an error we don't try to repair it, we just fail the device.
    A replacement device that gives errors does not sound sensible.

    This requires removing the setting of R5_ReadError when we get
    a read error during a read that bypasses the cache. It was probably
    a bad idea anyway as we don't know that every block in the read
    caused an error, and it could cause ReadError to be set for the
    replacement device, which is bad.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • We current initialise some fields of a bio when preparing a
    stripe_head, and again just before submitting the request.

    Remove the duplication by only setting the fields that lower level
    devices don't touch in raid5_build_block, and only set the changeable
    fields in ops_run_io.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Remove some #defines that are no longer used, and replace some
    others with an enum.
    And remove an unused field.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • Just enhance data structures to record a second device per slot to be
    used as a 'replacement' device, replacing the original.
    We also have a second bio in each slot in each stripe_head. This will
    only be used when writing to the array - we need to write to both the
    original and the replacement at the same time, so will need two bios.

    For now, only try using the replacement drive for aligned-reads.
    In this case, we prefer the replacement if it has been recovered far
    enough, otherwise use the original.

    This includes a small enhancement. Previously we would only do
    aligned reads if the target device was fully recovered. Now we also
    do them if it has recovered far enough.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • hot-replace is a feature being added to md which will allow a
    device to be replaced without removing it from the array first.

    With hot-replace a spare can be activated and recovery can start while
    the original device is still in place, thus allowing a transition from
    an unreliable device to a reliable device without leaving the array
    degraded during the transition. It can also be use when the original
    device is still reliable but it not wanted for some reason.

    This will eventually be supported in RAID4/5/6 and RAID10.

    This patch adds a super-block flag to distinguish the replacement
    device. If an old kernel sees this flag it will reject the device.

    It also adds two per-device flags which are viewable and settable via
    sysfs.
    "want_replacement" can be set to request that a device be replaced.
    "replacement" is set to show that this device is replacing another
    device.

    The "rd%d" links in /sys/block/mdXx/md only apply to the original
    device, not the replacement. We currently don't make links for the
    replacement - there doesn't seem to be a need.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Soon an array will be able to have multiple devices with the
    same raid_disk number (an original and a replacement). So removing
    a device based on the number won't work. So pass the actual device
    handle instead.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • When setting the slot number on a device in an active array we
    currently check that the number is not already in use.
    We then call into the personality's hot_add_disk function
    which performs the same test and returns the same error.

    Thus the common test is not needed.

    As we will shortly be changing some personalities to allow duplicates
    in some cases (to support hot-replace), the common test will become
    inconvenient.

    So remove the common test.

    Reviewed-by: Dan Williams
    Signed-off-by: NeilBrown

    NeilBrown
     
  • For each active region corresponding to a bit in the bitmap with have
    a 14bit counter (and some flags).
    This counts
    number of active writes + bit in the on-disk bitmap + delay-needed.

    The "delay-needed" is because we always want a delay before clearing a
    bit. So the number here is normally number of active writes plus 2.
    If there have been no writes for a while, we drop to 1.
    If still no writes we clear the bit and drop to 0.

    So for consistency, when setting bit from the on-disk bitmap or by
    request from user-space it is best to set the counter to '2' to start
    with.

    In particular we might also set the NEEDED_MASK flag at this time, and
    in all other cases NEEDED_MASK is only set when the counter is 2 or
    more.

    Signed-off-by: NeilBrown

    NeilBrown