07 Jun, 2008

3 commits

  • If a block is computed (rather than read) then a check/repair operation
    may be lead to believe that the data on disk is correct, when infact it
    isn't. So only compute blocks for failed devices.

    This issue has been around since at least 2.6.12, but has become harder to
    hit in recent kernels since most reads bypass the cache.

    echo repair > /sys/block/mdN/md/sync_action will set the parity blocks to the
    correct state.

    Cc:
    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • If an array was created with --assume-clean we will oops when trying to
    set ->resync_max.

    Fix this by initializing ->recovery_wait in mddev_find.

    Cc:
    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • During the initial array synchronization process there is a window between
    when a prexor operation is scheduled to a specific stripe and when it
    completes for a sync_request to be scheduled to the same stripe. When
    this happens the prexor completes and the stripe is unconditionally marked
    "insync", effectively canceling the sync_request for the stripe. Prior to
    2.6.23 this was not a problem because the prexor operation was done under
    sh->lock. The effect in older kernels being that the prexor would still
    erroneously mark the stripe "insync", but sync_request would be held off
    and re-mark the stripe as "!in_sync".

    Change the write completion logic to not mark the stripe "in_sync" if a
    prexor was performed. The effect of the change is to sometimes not set
    STRIPE_INSYNC. The worst this can do is cause the resync to stall waiting
    for STRIPE_INSYNC to be set. If this were happening, then STRIPE_SYNCING
    would be set and handle_issuing_new_read_requests would cause all
    available blocks to eventually be read, at which point prexor would never
    be used on that stripe any more and STRIPE_INSYNC would eventually be set.

    echo repair > /sys/block/mdN/md/sync_action will correct arrays that may
    have lost this race.

    Cc:
    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

25 May, 2008

8 commits

  • When we get any IO error during a recovery (rebuilding a spare), we abort
    the recovery and restart it.

    For RAID6 (and multi-drive RAID1) it may not be best to restart at the
    beginning: when multiple failures can be tolerated, the recovery may be
    able to continue and re-doing all that has already been done doesn't make
    sense.

    We already have the infrastructure to record where a recovery is up to
    and restart from there, but it is not being used properly.
    This is because:
    - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
    which causes the recovery not be be checkpointed.
    - We remove spares and then re-added them which loses important state
    information.

    The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
    needed. If there is an error, the relevant drive will be marked as
    Faulty, and that is enough to ensure correct handling of the error. So we
    first remove MD_RECOVERY_ERR, changing some of the uses of it to
    MD_RECOVERY_INTR.

    Then we cause the attempt to remove a non-faulty device from an array to
    fail (unless recovery is impossible as the array is too degraded). Then
    when remove_and_add_spares attempts to remove the devices on which
    recovery can continue, it will fail, they will remain in place, and
    recovery will continue on them as desired.

    Issue: If we are halfway through rebuilding a spare and another drive
    fails, and a new spare is immediately available, do we want to:
    1/ complete the current rebuild, then go back and rebuild the new spare or
    2/ restart the rebuild from the start and rebuild both devices in
    parallel.

    Both options can be argued for. The code currently takes option 2 as
    a/ this requires least code change
    b/ this results in a minimally-degraded array in minimal time.

    Cc: "Eivind Sarto"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • In some configurations, a raid6 resync can be limited by CPU speed
    (Calculating P and Q and moving data) rather than by device speed. In
    these cases there is nothing to be gained byt serialising resync of arrays
    that share a device, and doing the resync in parallel can provide benefit.
    So add a sysfs tunable to flag an array as being allowed to resync in
    parallel with other arrays that use (a different part of) the same device.

    Signed-off-by: Bernd Schubert
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bernd Schubert
     
  • This additional notification to 'array_state' is needed to allow the
    monitor application to learn about stop events via sysfs. The
    sysfs_notify("sync_action") call that comes at the end of do_md_stop()
    (via md_new_event) is insufficient since the 'sync_action' attribute has
    been removed by this point.

    (Seems like a sysfs-notify-on-removal patch is a better fix. Currently
    removal updates the event count but does not wake up waiters)

    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • When an array enters write pending, 'array_state' changes, so we must be
    sure to sysfs_notify.

    Also, when waiting for user-space to acknowledge 'write-pending' by
    marking the metadata as dirty, we don't want to wait for MD_CHANGE_DEVS to
    be cleared as that might not happen. So explicity test for the bits that
    we are really interested in.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • When performing a "recovery" or "check" pass on a RAID1 array, we read
    from each device and possible, if there is a difference or a read error,
    write back to some devices.

    We use the same 'bio' for both read and write, resetting various fields
    between the two operations.

    We forgot to reset bv_offset and bv_len however. These are often left
    unchanged, but in the case where there is an IO error one or two sectors
    into a page, they are changed.

    This results in correctable errors not being corrected properly. It does
    not result in any data corruption.

    Cc: "Fairbanks, David"
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • Last night we had scsi problems and a hardware raid unit was offlined
    during heavy i/o. While this happened we got for about 3 minutes a huge
    number messages like these

    Apr 12 03:36:07 pfs1n14 kernel: [197510.696595] raid5:md7: read error not correctable (sector 2993096568 on sdj2).

    I guess the high error rate is responsible for not scheduling other events
    - during this time the system was not pingable and in the end also other
    devices run into scsi command timeouts causing problems on these unrelated
    devices as well.

    Signed-off-by: Bernd Schubert
    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bernd Schubert
     
  • Kill the trivial and rather pointless file_path wrapper around d_path.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • It is possible to add a write-intent bitmap to an active array, or remove
    the bitmap that is there.

    When we do with the 'quiesce' the array, which causes make_request to
    block in "wait_barrier()".

    However we are sampling the value of "mddev->bitmap" before the
    wait_barrier call, and using it afterwards. This can result in using a
    bitmap structure that has been freed.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

15 May, 2008

1 commit

  • As setting and clearing queue flags now requires that we hold a spinlock
    on the queue, and as blk_queue_stack_limits is called without that lock,
    get the lock inside blk_queue_stack_limits.

    For blk_queue_stack_limits to be able to find the right lock, each md
    personality needs to set q->queue_lock to point to the appropriate lock.
    Those personalities which didn't previously use a spin_lock, us
    q->__queue_lock. So always initialise that lock when allocated.

    With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
    longer cause warnings as it will be clear that the proper lock is held.

    Thanks to Dan Williams for review and fixing the silly bugs.

    Signed-off-by: NeilBrown
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Alistair John Strachan
    Cc: Nick Piggin
    Cc: "Rafael J. Wysocki"
    Cc: Jacek Luczak
    Cc: Prakash Punnoor
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     

13 May, 2008

1 commit

  • commit bd2ab67030e9116f1e4aae1289220255412b37fd "md: close a livelock window
    in handle_parity_checks5" introduced a bug in handling 'repair' operations.
    After a repair operation completes we clear the state bits tracking this
    operation. However, they are cleared too early and this results in the code
    deciding to re-run the parity check operation. Since we have done the repair
    in memory the second check does not find a mismatch and thus does not do a
    writeback.

    Test results:
    $ echo repair > /sys/block/md0/md/sync_action
    $ cat /sys/block/md0/md/mismatch_cnt
    51072
    $ echo repair > /sys/block/md0/md/sync_action
    $ cat /sys/block/md0/md/mismatch_cnt
    0

    (also fix incorrect indentation)

    Cc:
    Tested-by: George Spelvin
    Acked-by: NeilBrown
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

09 May, 2008

1 commit

  • drivers/md/raid10.c:889:17: warning: Using plain integer as NULL pointer
    drivers/media/video/cx18/cx18-driver.c:616:12: warning: Using plain integer as NULL pointer
    sound/oss/kahlua.c:70:12: warning: Using plain integer as NULL pointer

    Signed-off-by: Harvey Harrison
    Cc: Neil Brown
    Cc: Mauro Carvalho Chehab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     

30 Apr, 2008

9 commits

  • Allows a userspace metadata handler to take action upon detecting a device
    failure.

    Based on an original patch by Neil Brown.

    Changes:
    -added blocked_wait waitqueue to rdev
    -don't qualify Blocked with Faulty always let userspace block writes
    -added md_wait_for_blocked_rdev to wait for the block device to be clear, if
    userspace misses the notification another one is sent every 5 seconds
    -set MD_RECOVERY_NEEDED after clearing "blocked"
    -kill DoBlock flag, just test mddev->external

    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Found when trying to reassemble an active externally managed array. Without
    this check we hit the more noisy "sysfs duplicate" warning in the later call
    to kobject_add.

    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • When setting an array to 'readonly' or to 'active' via sysfs, we must make the
    appropriate set_disk_ro call too.

    Also when switching to "read_auto" (which is like readonly, but blocks on the
    first write so that metadata can be marked 'dirty') we need to be more careful
    about what state we are changing from.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • 'safemode' relates to marking an array as 'clean' if there has been no write
    traffic for a while (a couple of seconds), to reduce the chance of the array
    being found dirty on reboot.

    ->safemode is set to '1' when there have been no write for a while, and it
    gets set to '0' when the superblock is updates with the 'clean' flag set.

    This requires a few fixes for 'external' metadata:
    - When an array is set to 'clean' via sysfs, 'safemode' must be cleared.
    - when we write to an array that has 'safemode' set (there must have been
    some delay in updating the metadata), we need to clear safemode.
    - Don't try to update external metadata in md_check_recovery for safemode
    transitions - it won't work.

    Also, don't try to support "immediate safe mode" (safemode==2) for external
    metadata, it cannot really work (the safemode timeout can be set very low if
    this is really needed).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • I keep finding problems where an mddev gets reused and some fields has a value
    from a previous usage that confuses the new usage. So clear all fields that
    could possible need clearing when calling do_md_stop.

    Also initialise the 'level' of a new array to LEVEL_NONE (which isn't 0).

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • All the metadata update processing for external metadata is on in user-space
    or through the sysfs interfaces, so make "md_update_sb" a no-op in that case.

    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     
  • rdev->mddev is no longer valid upon return from entry->store() when the
    'remove' command is given.

    Cc:
    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • dm.c already provides mutual exclusion through ->map_lock.

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

29 Apr, 2008

4 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: Skip I/O merges when disabled
    block: add large command support
    block: replace sizeof(rq->cmd) with BLK_MAX_CDB
    ide: use blk_rq_init() to initialize the request
    block: use blk_rq_init() to initialize the request
    block: rename and export rq_init()
    block: no need to initialize rq->cmd with blk_get_request
    block: no need to initialize rq->cmd in prepare_flush_fn hook
    block/blk-barrier.c:blk_ordered_cur_seq() mustn't be inline
    block/elevator.c:elv_rq_merge_ok() mustn't be inline
    block: make queue flags non-atomic
    block: add dma alignment and padding support to blk_rq_map_kern
    unexport blk_max_pfn
    ps3disk: Remove superfluous cast
    block: make rq_init() do a full memset()
    relay: fix splice problem

    Linus Torvalds
     
  • Use proc_create()/proc_create_data() to make sure that ->proc_fops and ->data
    be setup before gluing PDE to main tree.

    Signed-off-by: Denis V. Lunev
    Cc: Greg Kroah-Hartman
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Peter Osterlund
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Dmitry Torokhov
    Cc: Neil Brown
    Cc: Mauro Carvalho Chehab
    Cc: Bjorn Helgaas
    Cc: Alessandro Zummo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis V. Lunev
     
  • blk_get_request initializes rq->cmd (rq_init does) so the users don't
    need to do that.

    The purpose of this patch is to remove sizeof(rq->cmd) and &rq->cmd,
    as a preparation for large command support, which changes rq->cmd from
    the static array to a pointer. sizeof(rq->cmd) will not make sense and
    &rq->cmd won't work.

    Signed-off-by: FUJITA Tomonori
    Cc: James Bottomley
    Cc: Alasdair G Kergon
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • We can save some atomic ops in the IO path, if we clearly define
    the rules of how to modify the queue flags.

    Signed-off-by: Jens Axboe

    Nick Piggin
     

28 Apr, 2008

6 commits

  • The functions time_before, time_before_eq, time_after, and time_after_eq
    are more robust for comparing jiffies against other values.

    A simplified version of the semantic patch making this change is as follows:
    (http://www.emn.fr/x-info/coccinelle/)

    //
    @ change_compare_np @
    expression E;
    @@

    (
    - jiffies = E
    + time_after_eq(jiffies,E)
    |
    - jiffies < E
    + time_before(jiffies,E)
    |
    - jiffies > E
    + time_after(jiffies,E)
    )

    @ include depends on change_compare_np @
    @@

    #include

    @ no_include depends on !include && change_compare_np @
    @@

    #include
    + #include
    //

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Julia Lawall
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • MD drivers use one printk() call to print 2 log messages and the second line
    may be prefixed by a TAB character. It may also output a trailing space
    before newline. klogd (I think) turns the TAB character into the 2 characters
    '^I' when logging to a file. This looks ugly.

    Instead of a leading TAB to indicate continuation, prefix both output lines
    with 'raid:' or similar. Also remove any trailing space in the vicinity of
    the affected code and consistently end the sentences with a period.

    Signed-off-by: Nick Andrew
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Andrew
     
  • strict_strtoul handles the open-coded sanity checks in
    raid5_store_stripe_cache_size and raid5_store_preread_threshold

    Acked-by: NeilBrown
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Improve write performance by preventing the delayed_list from dumping all its
    stripes onto the handle_list in one shot. Delayed stripes are now further
    delayed by being held on the 'hold_list'. The 'hold_list' is bypassed when:

    * a STRIPE_IO_STARTED stripe is found at the head of 'handle_list'
    * 'handle_list' is empty and i/o is being done to satisfy full stripe-width
    write requests
    * 'bypass_count' is less than 'bypass_threshold'. By default the threshold
    is 1, i.e. every other stripe handled is a preread stripe provided the
    top two conditions are false.

    Benchmark data:
    System: 2x Xeon 5150, 4x SATA, mem=1GB
    Baseline: 2.6.24-rc7
    Configuration: mdadm --create /dev/md0 /dev/sd[b-e] -n 4 -l 5 --assume-clean
    Test1: dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
    * patched: +33% (stripe_cache_size = 256), +25% (stripe_cache_size = 512)

    Test2: tiobench --size 2048 --numruns 5 --block 4096 --block 131072 (XFS)
    * patched: +13%
    * patched + preread_bypass_threshold = 0: +37%

    Changes since v1:
    * reduce bypass_threshold from (chunk_size / sectors_per_chunk) to (1) and
    make it configurable. This defaults to fairness and modest performance
    gains out of the box.
    Changes since v2:
    * [neilb@suse.de]: kill STRIPE_PRIO_HI and preread_needed as they are not
    necessary, the important change was clearing STRIPE_DELAYED in
    add_stripe_bio and this has been moved out to make_request for the hang
    fix.
    * [neilb@suse.de]: simplify get_priority_stripe
    * [dan.j.williams@intel.com]: reset the bypass_count when ->hold_list is
    sampled empty (+11%)
    * [dan.j.williams@intel.com]: decrement the bypass_count at the detection
    of stripes being naturally promoted off of hold_list +2%. Note, resetting
    bypass_count instead of decrementing on these events yields +4% but that is
    probably too aggressive.
    Changes since v3:
    * cosmetic fixups

    Tested-by: James W. Laferriere
    Signed-off-by: Dan Williams
    Signed-off-by: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • __FUNCTION__ is gcc-specific, use __func__

    Signed-off-by: Harvey Harrison
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • drivers/md/md.c:734:16: warning: Using plain integer as NULL pointer
    drivers/md/md.c:1115:16: warning: Using plain integer as NULL pointer

    Add some braces to match the else-block as well.

    Signed-off-by: Harvey Harrison
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     

25 Apr, 2008

7 commits

  • The small patch below:
    - Removes the unused md argument from both specific_minor() and next_free_minor()
    - Folds kmalloc + memset(0) into a single kzalloc call in alloc_dev()

    This has been compile tested on x86.

    Signed-off-by: Frederik Deweerdt
    Signed-off-by: Alasdair G Kergon

    Frederik Deweerdt
     
  • dm_create_error_table() was added in kernel 2.6.18 and never used...

    Signed-off-by: Adrian Bunk
    Signed-off-by: Alasdair G Kergon

    Adrian Bunk
     
  • void returning functions returned the return value of another void
    returning function...

    Spotted by sparse.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Alasdair G Kergon

    Adrian Bunk
     
  • Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.

    It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
    queue (i.e. block the requests from being dispatched to the physical device).

    The queue is unplugged when the caller calls blk_unplug() function. Usually, the
    sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
    the queue and waits (to be possibly joined with other adjacent bios). Then, when
    the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
    the disk.

    This was happenning:

    When doing O_SYNC writes, function fsync_buffers_list() submits a list of
    bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
    woken up.

    fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but
    there are no bios on the device queue as they are still in the dm_raid1 queue.

    wait_on_buffer() starts waiting until the IO is finished.

    kmirrord is scheduled, kmirrord takes bios and submits them to the devices.

    The submitted bio plugs the harddisk queue but there is no one to unplug it.
    (The process that called wait_on_buffer() is already sleeping.)

    So there is a 3ms timeout, after which the queues on the harddisks are
    unplugged and requests are processed.

    This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
    dm-raid1 is 10 times slower than md raid1.

    Every time we submit something asynchronously via dm_io, we must unplug the
    queue actually to send the request to the device.

    This patch adds an unplug call to kmirrord - while processing requests, it keeps
    the queue plugged (so that adjacent bios can be merged); when it finishes
    processing all the bios, it unplugs the queue to submit the bios.

    It also fixes kcopyd which has the same potential problem. All kcopyd requests
    are submitted with BIO_RW_SYNC.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Acked-by: Jens Axboe

    Mikulas Patocka
     
  • This patch replaces the schedule() in the main kmirrord thread with a timer.
    The schedule() could introduce an unwanted delay when work is ready to be
    processed.

    The code instead calls wake() when there's work to be done immediately, and
    delayed_wake() after a failure to give a short delay before retrying.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Publish the dm-io, dm-log and dm-kcopyd headers in include/linux.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Rename kcopyd.[ch] to dm-kcopyd.[ch].

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon