23 Dec, 2011

1 commit

  • While reshaping a degraded array (as when reshaping a RAID0 by first
    converting it to a degraded RAID4) we currently get confused about
    which devices are in_sync. In most cases we get it right, but in the
    region that is being reshaped we need to treat non-failed devices as
    in-sync when we have the data but haven't actually written it out yet.

    Reported-by: Adam Kwolek
    Signed-off-by: NeilBrown

    NeilBrown
     

09 Dec, 2011

1 commit


08 Dec, 2011

1 commit

  • Once a device is failed we really want to completely ignore it.
    It should go away soon anyway.

    In particular the presence of bad blocks on it should not cause us to
    block as we won't be trying to write there anyway.

    So as soon as we can check if a device is Faulty, do so and pretend
    that it is already gone if it is Faulty.

    Signed-off-by: NeilBrown

    NeilBrown
     

08 Nov, 2011

2 commits

  • All updates that occur under STRIPE_ACTIVE should be globally visible
    when STRIPE_ACTIVE clears. test_and_set_bit() implies a barrier, but
    clear_bit() does not.

    This is suitable for 3.1-stable.

    Signed-off-by: Dan Williams
    Signed-off-by: NeilBrown
    Cc: stable@kernel.org

    Dan Williams
     
  • When the number of failed devices exceeds the allowed number
    we must abort any active parity operations (checks or updates) as they
    are no longer meaningful, and can lead to a BUG_ON in
    handle_parity_checks6.

    This bug was introduce by commit 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8
    in 2.6.29.

    Reported-by: Manish Katiyar
    Tested-by: Manish Katiyar
    Acked-by: Dan Williams
    Signed-off-by: NeilBrown
    Cc: stable@kernel.org

    NeilBrown
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

05 Nov, 2011

1 commit

  • * 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits)
    block: don't call blk_drain_queue() if elevator is not up
    blk-throttle: use queue_is_locked() instead of lockdep_is_held()
    blk-throttle: Take blkcg->lock while traversing blkcg->policy_list
    blk-throttle: Free up policy node associated with deleted rule
    block: warn if tag is greater than real_max_depth.
    block: make gendisk hold a reference to its queue
    blk-flush: move the queue kick into
    blk-flush: fix invalid BUG_ON in blk_insert_flush
    block: Remove the control of complete cpu from bio.
    block: fix a typo in the blk-cgroup.h file
    block: initialize the bounce pool if high memory may be added later
    block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown
    block: drop @tsk from attempt_plug_merge() and explain sync rules
    block: make get_request[_wait]() fail if queue is dead
    block: reorganize throtl_get_tg() and blk_throtl_bio()
    block: reorganize queue draining
    block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg()
    block: pass around REQ_* flags instead of broken down booleans during request alloc/free
    block: move blk_throtl prototypes to block/blk.h
    block: fix genhd refcounting in blkio_policy_parse_and_set()
    ...

    Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion
    and making the request functions be of type "void" instead of "int" in
    - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c}
    - drivers/staging/zram/zram_drv.c

    Linus Torvalds
     

01 Nov, 2011

1 commit


26 Oct, 2011

2 commits

  • In 3.0 we changed the way recovery_disabled was handle so that instead
    of testing against zero, we test an mddev-> value against a conf->
    value.
    Two problems:
    1/ one place in raid1 was missed and still sets to '1'.
    2/ We didn't explicitly set the conf-> value at array creation
    time.
    It defaulted to '0' just like the mddev value does so they
    could appear equal and thus disable recovery.
    This did not affect normal 'md' as it calls bind_rdev_to_array
    which changes the mddev value. However the dmraid interface
    doesn't call this and so doesn't change ->recovery_disabled; so at
    array start all recovery is incorrectly disabled.

    So initialise the 'conf' value to one less that the mddev value, so
    the will only be the same when explicitly set that way.

    Reported-by: Jonathan Brassow
    Signed-off-by: NeilBrown

    NeilBrown
     
  • This bug was introduced in 415e72d034c50520ddb7ff79e7d1792c1306f0c9
    which was in 2.6.36.

    There is a small window of time between when a device fails and when
    it is removed from the array. During this time we might still read
    from it, but we won't write to it - so it is possible that we could
    read stale data.

    We didn't need the test of 'Faulty' before because the test on
    In_sync is sufficient. Since we started allowing reads from the early
    part of non-In_sync devices we need a test on Faulty too.

    This is suitable for any kernel from 2.6.36 onwards, though the patch
    might need a bit of tweaking in 3.0 and earlier.

    Cc: stable@kernel.org
    Signed-off-by: NeilBrown

    NeilBrown
     

19 Oct, 2011

1 commit


11 Oct, 2011

5 commits


07 Oct, 2011

3 commits


21 Sep, 2011

1 commit

  • Two related problems:

    1/ some error paths call "md_unregister_thread(mddev->thread)"
    without subsequently clearing ->thread. A subsequent call
    to mddev_unlock will try to wake the thread, and crash.

    2/ Most calls to md_wakeup_thread are protected against the thread
    disappeared either by:
    - holding the ->mutex
    - having an active request, so something else must be keeping
    the array active.
    However mddev_unlock calls md_wakeup_thread after dropping the
    mutex and without any certainty of an active request, so the
    ->thread could theoretically disappear.
    So we need a spinlock to provide some protections.

    So change md_unregister_thread to take a pointer to the thread
    pointer, and ensure that it always does the required locking, and
    clears the pointer properly.

    Reported-by: "Moshe Melnikov"
    Signed-off-by: NeilBrown
    cc: stable@kernel.org

    NeilBrown
     

12 Sep, 2011

1 commit

  • There is very little benefit in allowing to let a ->make_request
    instance update the bios device and sector and loop around it in
    __generic_make_request when we can archive the same through calling
    generic_make_request from the driver and letting the loop in
    generic_make_request handle it.

    Note that various drivers got the return value from ->make_request and
    returned non-zero values for errors.

    Signed-off-by: Christoph Hellwig
    Acked-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

31 Aug, 2011

1 commit

  • Waiting for a 'blocked' rdev to become unblocked in the raid5d thread
    cannot work with internal metadata as it is the raid5d thread which
    will clear the blocked flag.
    This wasn't a problem in 3.0 and earlier as we only set the blocked
    flag when external metadata was used then.
    However we now set it always, so we need to be more careful.

    Signed-off-by: NeilBrown

    NeilBrown
     

28 Jul, 2011

7 commits

  • On a successful write to a known bad block, flag the sh
    so that raid5d can remove the known bad block from the list.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If a device has seen write errors, don't write to any known
    bad blocks on that device.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • When a write error is detected, don't mark the device as failed
    immediately but rather record the fact for handle_stripe to deal with.

    Handle_stripe then attempts to record a bad block. Only if that fails
    does the device get marked as faulty.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • If we get an uncorrectable read error - record a bad block rather than
    failing the device.
    And if these errors (which may be due to known bad blocks) cause
    recovery to be impossible, record a bad block on the recovering
    devices, or abort the recovery.

    As we might abort a recovery without failing a device we need to teach
    RAID5 about recovery_disabled handling.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • There are two times that we might read in raid5:
    1/ when a read request fits within a chunk on a single
    working device.
    In this case, if there is any bad block in the range of
    the read, we simply fail the cache-bypass read and
    perform the read though the stripe cache.

    2/ when reading into the stripe cache. In this case we
    mark as failed any device which has a bad block in that
    strip (1 page wide).
    Note that we will both avoid reading and avoid writing.
    This is correct (as we will never read from the block, there
    is no point writing), but not optimal (as writing could 'fix'
    the error) - that will be addressed later.

    If we have not seen any write errors on the device yet, we treat a bad
    block like a recent read error. This will encourage an attempt to fix
    the read error which will either generate a write error, or will
    ensure good data is stored there. We don't yet forget the bad block
    in that case. That comes later.

    Now that we honour bad blocks when reading we can allow devices with
    bad blocks into the array.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • It is only safe to choose not to write to a bad block if that bad
    block is safely recorded in metadata - i.e. if it has been
    'acknowledged'.

    If it hasn't we need to wait for the acknowledgement.

    We support that using rdev->blocked wait and
    md_wait_for_blocked_rdev by introducing a new device flag
    'BlockedBadBlock'.

    This flag is only advisory.
    It is cleared whenever we acknowledge a bad block, so that a waiter
    can re-check the particular bad blocks that it is interested it.

    It should be set by a caller when they find they need to wait.
    This (set after test) is inherently racy, but as
    md_wait_for_blocked_rdev already has a timeout, losing the race will
    have minimal impact.

    When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
    was set incorrectly (see above race).

    We also modify the way we manage 'Blocked' to fit better with the new
    handling of 'BlockedBadBlocks' and to make it consistent between
    externally managed and internally managed metadata. This requires
    that each raidXd loop checks if the metadata needs to be written and
    triggers a write (md_check_recovery) if needed. Otherwise a queued
    write request might cause raidXd to wait for the metadata to write,
    and only that thread can write it.

    Before writing metadata, we set FaultRecorded for all devices that
    are Faulty, then after writing the metadata we clear Blocked for any
    device for which the Fault was certainly Recorded.

    The 'faulty' device flag now appears in sysfs if the device is faulty
    *or* it has unacknowledged bad blocks. So user-space which does not
    understand bad blocks can continue to function correctly.
    User space which does, should not assume a device is faulty until it
    sees the 'faulty' flag, and then sees the list of unacknowledged bad
    blocks is empty.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • As no personality understand bad block lists yet, we must
    reject any device that is known to contain bad blocks.
    As the personalities get taught, these tests can be removed.

    This only applies to raid1/raid5/raid10.
    For linear/raid0/multipath/faulty the whole concept of bad blocks
    doesn't mean anything so there is no point adding the checks.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     

27 Jul, 2011

11 commits

  • While preparing to write a stripe we keep the parity block or blocks
    locked (R5_LOCKED) - towards the end of schedule_reconstruction.

    If the array is discovered to have failed before this write completes
    we can leave those blocks LOCKED, and init_stripe will notice that a
    free stripe still has a locked block and will complain.

    So clear the R5_LOCKED flag in handle_failed_stripe, and demote the
    'BUG' to a 'WARN_ON'.

    Signed-off-by: NeilBrown

    NeilBrown
     
  • Read errors are considered to corrected if write-back and re-read
    cycle is finished without further problems. Thus moving the rdev->
    corrected_errors counting after the re-reading looks more reasonable
    IMHO.

    Signed-off-by: Namhyung Kim
    Signed-off-by: NeilBrown

    Namhyung Kim
     
  • There are places where sysfs links to rdev are handled
    in a same way. Add the helper functions to consolidate
    them.

    Signed-off-by: Namhyung Kim
    Signed-off-by: NeilBrown

    Namhyung Kim
     
  • As per printk_ratelimit comment, it should not be used.

    Signed-off-by: Christian Dietrich
    Signed-off-by: NeilBrown

    Christian Dietrich
     
  • handle_stripe5() and handle_stripe6() are now virtually identical.
    So discard one and rename the other to 'analyse_stripe()'.

    It always returns 0, so change it to 'void' and remove the 'done'
    variable in handle_stripe().

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • The RAID6 version of this code is usable for RAID5 providing:
    - we test "conf->max_degraded" rather than "2" as appropriate
    - we make sure s->failed_num[1] is meaningful (and not '-1')
    when s->failed > 1

    The 'return 1' must become 'goto finish' in the new location.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • Apart from 'prexor' which can only be set for RAID5, and
    'qd_idx' which can only be meaningful for RAID6, these two
    chunks of code are nearly the same.

    So combine them into one adding a test to call either
    handle_parity_checks5 or handle_parity_checks6 as appropriate.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
    also allow 'read-modify-write'
    Apart from this difference, handle_stripe_dirtying[56] are nearly
    identical. So resolve these differences and create just one function.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • Provided that ->failed_num[1] is not a valid device number (which is
    easily achieved) fetch_block6 provides all the functionality of
    fetch_block5.

    So remove the latter and rename the former to simply "fetch_block".

    Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
    can similarly be united.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • Next patch will unite fetch_block5 and fetch_block6.
    First I want to make the differences a little more clear.

    For RAID6 if we are writing at all and there is a failed device, then
    we need to load or compute every block so we can do a
    reconstruct-write.
    This case isn't needed for RAID5 - we will do a read-modify-write in
    that case.
    So make that test a separate test in fetch_block6 rather than merged
    with two other tests.

    Make a similar change in fetch_block5 so the one bit that is not
    needed for RAID6 is clearly separate.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown
     
  • The difference between the RAID5 and RAID6 code here is easily
    resolved using conf->max_degraded.

    Signed-off-by: NeilBrown
    Reviewed-by: Namhyung Kim

    NeilBrown