06 Mar, 2010

3 commits

  • To prevent deadlock, bios in the hold list should be flushed before
    dm_rh_stop_recovery() is called in mirror_suspend().

    The recovery can't start because there are pending bios and therefore
    dm_rh_stop_recovery deadlocks.

    When there are pending bios in the hold list, the recovery waits for
    the completion of the bios after recovery_count is acquired.
    The recovery_count is released when the recovery finished, however,
    the bios in the hold list are processed after dm_rh_stop_recovery() in
    mirror_presuspend(). dm_rh_stop_recovery() also acquires recovery_count,
    then deadlock occurs.

    Signed-off-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon
    Reviewed-by: Mikulas Patocka

    Takahiro Yasui
     
  • Remove unused parameters(start and len) of dm_get_device()
    and fix the callers.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Alasdair G Kergon

    Nikanth Karthikesan
     
  • If all mirror legs fail, always return an error instead of holding the
    bio, even if the handle_errors option was set. At present it is the
    responsibility of the driver underneath us to deal with retries,
    multipath etc.

    The patch adds the bio to the failures list instead of holding it
    directly. do_failures tests first if all legs failed and, if so,
    returns the bio with -EIO. If any leg is still alive and handle_errors
    is set, do_failures calls hold_bio.

    Reviewed-by: Takahiro Yasui
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

17 Feb, 2010

1 commit

  • If the mirror log fails when the handle_errors option was not selected
    and there is no remaining valid mirror leg, writes return success even
    though they weren't actually written to any device. This patch
    completes them with EIO instead.

    This code path is taken:
    do_writes:
    bio_list_merge(&ms->failures, &sync);
    do_failures:
    if (!get_valid_mirror(ms)) (false)
    else if (errors_handled(ms)) (false)
    else bio_endio(bio, 0);

    The logic in do_failures is based on presuming that the write was already
    tried: if it succeeded at least on one leg (without handle_errors) it
    is reported as success.

    Reference: https://bugzilla.redhat.com/show_bug.cgi?id=555197

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

11 Dec, 2009

11 commits

  • Explicitly initialize bio lists instead of relying on kzalloc.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Hold all write bios when leg fails and errors are handled

    When using a userspace daemon such as dmeventd to handle errors, we must
    delay completing bios until it has done its job.
    This patch prevents the following race:
    - primary leg fails
    - write "1" fail, the write is held, secondary leg is set default
    - write "2" goes straight to the secondary leg

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Hold all write bios when errors are handled.

    Previously the failures list was used only when handling errors with
    a userspace daemon such as dmeventd. Now, it is always used for all bios.
    The regions where some writes failed must be marked as nosync. This can only
    be done in process context (i.e. in raid1 workqueue), not in the
    write_callback function.

    Previously the write would succeed if writing to at least one leg
    succeeded. This is wrong because data from the failed leg may be
    replicated to the correct leg. Now, if using a userspace daemon, the
    write with some failures will be held until the daemon has done its job
    and reconfigured the array. If not using a daemon, the write still
    succeeds if at least one leg succeeds. This is bad, but it is consistent
    with current behavior.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move bio completion out of dm_rh_mark_nosync in preparation for the
    next patch.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move the logic to get a valid mirror leg into a function for re-use
    in a later patch.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Use the hold framework in do_failures.

    This patch doesn't change the bio processing logic, it just simplifies
    failure handling and avoids periodically polling the failures list.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Add framework to delay bios until a suspend and then resubmit them with
    either DM_ENDIO_REQUEUE (if the suspend was noflush) or complete them
    with -EIO. I/O barrier support will use this.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Report flush errors as 'F' instead of 'D' for log and mirror devices.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Implement flush callee. It uses dm_io to send zero-size barrier synchronously
    and concurrently to all the mirror legs.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Introduce a callback pointer from the log to dm-raid1 layer.

    Before some region is set as "in-sync", we need to flush hardware cache on
    all the disks. But the log module doesn't have access to the mirror_set
    structure. So it will use this callback.

    So far the callback is unused, it will be used in further patches.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Flush support for dm-raid1.

    When it receives an empty barrier, submit it to all the devices via dm-io.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

11 Sep, 2009

1 commit


05 Sep, 2009

1 commit

  • This patch fixes a bug which was triggering a case where the primary leg
    could not be changed on failure even when the mirror was in-sync.

    The case involves the failure of the primary device along with
    the transient failure of the log device. The problem is that
    bios can be put on the 'failures' list (due to log failure)
    before 'fail_mirror' is called due to the primary device failure.
    Normally, this is fine, but if the log device failure is transient,
    a subsequent iteration of the work thread, 'do_mirror', will
    reset 'log_failure'. The 'do_failures' function then resets
    the 'in_sync' variable when processing bios on the failures list.
    The 'in_sync' variable is what is used to determine if the
    primary device can be switched in the event of a failure. Since
    this has been reset, the primary device is incorrectly assumed
    to be not switchable.

    The case has been seen in the cluster mirror context, where one
    machine realizes the log device is dead before the other machines.
    As the responsibilities of the server migrate from one node to
    another (because the mirror is being reconfigured due to the failure),
    the new server may think for a moment that the log device is fine -
    thus resetting the 'log_failure' variable.

    In any case, it is inappropiate for us to reset the 'log_failure'
    variable. The above bug simply illustrates that it can actually
    hurt us.

    Cc: stable@kernel.org
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     

24 Jul, 2009

2 commits

  • Incorrect device area lengths are being passed to device_area_is_valid().

    The regression appeared in 2.6.31-rc1 through commit
    754c5fc7ebb417b23601a6222a6005cc2e7f2913.

    With the dm-stripe target, the size of the target (ti->len) was used
    instead of the stripe_width (ti->len/#stripes). An example of a
    consequent incorrect error message is:

    device-mapper: table: 254:0: sdb too small for target

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • The recent commit 7513c2a761d69d2a93f17146b3563527d3618ba0 (dm raid1:
    add is_remote_recovering hook for clusters) changed do_writes() to
    update the ms->writes list but forgot to wake up kmirrord to process it.

    The rule is that when anything is being added on ms->reads, ms->writes
    or ms->failures and the list was empty before we must call
    wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
    processing). Otherwise the bios could sit on the list indefinitely.

    Signed-off-by: Mikulas Patocka
    CC: stable@kernel.org
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

22 Jun, 2009

1 commit

  • Add .iterate_devices to 'struct target_type' to allow a function to be
    called for all devices in a DM target. Implemented it for all targets
    except those in dm-snap.c (origin and snapshot).

    (The raid1 version number jumps to 1.12 because we originally reserved
    1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors'
    instead.)

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Cc: martin.petersen@oracle.com

    Mike Snitzer
     

15 Apr, 2009

1 commit


03 Apr, 2009

2 commits

  • The logging API needs an extra function to make cluster mirroring
    possible. This new function allows us to check whether a mirror
    region is being recovered on another machine in the cluster. This
    helps us prevent simultaneous recovery I/O and process I/O to the
    same locations on disk.

    Cluster-aware log modules will implement this function. Single
    machine log modules will not. So, there is no performance
    penalty for single machine mirrors.

    Signed-off-by: Jonathan Brassow
    Acked-by: Heinz Mauelshagen
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     
  • With my previous patch to save bi_io_vec, the size of dm_raid1_read_record
    is significantly increased (the vector list takes 3072 bytes on 32-bit machines
    and 4096 bytes on 64-bit machines).

    The structure dm_raid1_read_record used to be allocated with kmalloc,
    but kmalloc aligns the size on the next power-of-two so an object
    slightly greater than 4096 will allocate 8192 bytes of memory and half of
    that memory will be wasted.

    This patch turns kmalloc into a slab cache which doesn't have this
    padding so it will reduce the memory consumed.

    Cc: stable@kernel.org
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

06 Jan, 2009

3 commits

  • Move log size validation from mirror target to log constructor.

    Removed PAGE_SIZE restriction we no longer think necessary.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Change dm_unregister_target to return void and use BUG() for error
    reporting.

    dm_unregister_target can only fail because of programming bug in the
    target driver. It can't fail because of user's behavior or disk errors.

    This patch changes unregister_target to return void and use BUG if
    someone tries to unregister non-registered target or unregister target
    that is in use.

    This patch removes code duplication (testing of error codes in all dm
    targets) and reports bugs in just one place, in dm_unregister_target. In
    some target drivers, these return codes were ignored, which could lead
    to a situation where bugs could be missed.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Always increase the error count when I/O on a leg of a mirror fails.

    The error count is used to decide whether to select an alternative
    mirror leg. If the target doesn't use the "handle_errors" feature, the
    error count is not updated and the bio can get requeued forever by the
    read callback.

    Fix it by increasing error_count before the handle_errors feature
    checking.

    Cc: stable@kernel.org
    Signed-off-by: Milan Broz
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     

14 Nov, 2008

1 commit


30 Oct, 2008

1 commit


22 Oct, 2008

3 commits


10 Oct, 2008

1 commit

  • dm-raid1 is setting the 'DM_KCOPYD_IGNORE_ERROR' flag unconditionally
    when assigning kcopyd work. kcopyd is responsible for copying an
    assigned section of disk to one or more other disks. The
    'DM_KCOPYD_IGNORE_ERROR' flag affects kcopyd in the following way:

    When not set:
    kcopyd will immediately stop the copy operation when an error is
    encountered.

    When set:
    kcopyd will try to proceed regardless of errors and try to continue
    copying any remaining amount.

    Since dm-raid1 tracks regions of the address space that are (or
    are not) in sync and it now has the ability to handle these
    errors, we can safely enable this optimization. This optimization
    is conditional on whether mirror error handling has been enabled.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     

25 Apr, 2008

8 commits

  • Remove an avoidable 3ms delay on some dm-raid1 and kcopyd I/O.

    It is specified that any submitted bio without BIO_RW_SYNC flag may plug the
    queue (i.e. block the requests from being dispatched to the physical device).

    The queue is unplugged when the caller calls blk_unplug() function. Usually, the
    sequence is that someone calls submit_bh to submit IO on a buffer. The IO plugs
    the queue and waits (to be possibly joined with other adjacent bios). Then, when
    the caller calls wait_on_buffer(), it unplugs the queue and submits the IOs to
    the disk.

    This was happenning:

    When doing O_SYNC writes, function fsync_buffers_list() submits a list of
    bios to dm_raid1, the bios are added to dm_raid1 write queue and kmirrord is
    woken up.

    fsync_buffers_list() calls wait_on_buffer(). That unplugs the queue, but
    there are no bios on the device queue as they are still in the dm_raid1 queue.

    wait_on_buffer() starts waiting until the IO is finished.

    kmirrord is scheduled, kmirrord takes bios and submits them to the devices.

    The submitted bio plugs the harddisk queue but there is no one to unplug it.
    (The process that called wait_on_buffer() is already sleeping.)

    So there is a 3ms timeout, after which the queues on the harddisks are
    unplugged and requests are processed.

    This 3ms timeout meant that in certain workloads (e.g. O_SYNC, 8kb writes),
    dm-raid1 is 10 times slower than md raid1.

    Every time we submit something asynchronously via dm_io, we must unplug the
    queue actually to send the request to the device.

    This patch adds an unplug call to kmirrord - while processing requests, it keeps
    the queue plugged (so that adjacent bios can be merged); when it finishes
    processing all the bios, it unplugs the queue to submit the bios.

    It also fixes kcopyd which has the same potential problem. All kcopyd requests
    are submitted with BIO_RW_SYNC.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon
    Acked-by: Jens Axboe

    Mikulas Patocka
     
  • This patch replaces the schedule() in the main kmirrord thread with a timer.
    The schedule() could introduce an unwanted delay when work is ready to be
    processed.

    The code instead calls wake() when there's work to be done immediately, and
    delayed_wake() after a failure to give a short delay before retrying.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Publish the dm-io, dm-log and dm-kcopyd headers in include/linux.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Clean up the dm-log interface to prepare for publishing it in include/linux.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Alasdair G Kergon

    Heinz Mauelshagen
     
  • Clean up the kcopyd interface to prepare for publishing it in include/linux.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Alasdair G Kergon

    Heinz Mauelshagen
     
  • Clean up the dm-io interface to prepare for publishing it in include/linux.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Alasdair G Kergon

    Heinz Mauelshagen
     
  • Move the dirty region log code into a separate module so
    other targets can share the code.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Alasdair G Kergon

    Heinz Mauelshagen
     
  • Use shorter list_splice_init() for brevity.

    Signed-off-by: Robert P. J. Day
    Signed-off-by: Alasdair G Kergon

    Robert P. J. Day