21 Aug, 2012

1 commit

  • flush[_delayed]_work_sync() are now spurious. Mark them deprecated
    and convert all users to flush[_delayed]_work().

    If you're cc'd and wondering what's going on: Now all workqueues are
    non-reentrant and the regular flushes guarantee that the work item is
    not pending or running on any CPU on return, so there's no reason to
    use the sync flushes at all and they're going away.

    This patch doesn't make any functional difference.

    Signed-off-by: Tejun Heo
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Ian Campbell
    Cc: Jens Axboe
    Cc: Mattia Dongili
    Cc: Kent Yoder
    Cc: David Airlie
    Cc: Jiri Kosina
    Cc: Karsten Keil
    Cc: Bryan Wu
    Cc: Benjamin Herrenschmidt
    Cc: Alasdair Kergon
    Cc: Mauro Carvalho Chehab
    Cc: Florian Tobias Schandinat
    Cc: David Woodhouse
    Cc: "David S. Miller"
    Cc: linux-wireless@vger.kernel.org
    Cc: Anton Vorontsov
    Cc: Sangbeom Kim
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: Eric Van Hensbergen
    Cc: Takashi Iwai
    Cc: Steven Whitehouse
    Cc: Petr Vandrovec
    Cc: Mark Fasheh
    Cc: Christoph Hellwig
    Cc: Avi Kivity

    Tejun Heo
     

27 Jul, 2012

3 commits

  • Commit outstanding metadata before returning the status for a dm thin
    pool so that the numbers reported are as up-to-date as possible.

    The commit is not performed if the device is suspended or if
    the DM_NOFLUSH_FLAG is supplied by userspace and passed to the target
    through a new 'status_flags' parameter in the target's dm_status_fn.

    The userspace dmsetup tool will support the --noflush flag with the
    'dmsetup status' and 'dmsetup wait' commands from version 1.02.76
    onwards.

    Tested-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Use boolean bit fields for flags in struct dm_target.

    Signed-off-by: Alasdair G Kergon

    Alasdair G Kergon
     
  • Remove the restriction that limits a target's specified maximum incoming
    I/O size to be a power of 2.

    Rename this setting from 'split_io' to the less-ambiguous 'max_io_len'.
    Change it from sector_t to uint32_t, which is plenty big enough, and
    introduce a wrapper function dm_set_target_max_io_len() to set it.
    Use sector_div() to process it now that it is not necessarily a power of 2.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     

20 Jul, 2012

2 commits

  • We can't guarantee that REQ_DISCARD on dm-mirror zeroes the data even if
    the underlying disks support zero on discard. So this patch sets
    ti->discard_zeroes_data_unsupported.

    For example, if the mirror is in the process of resynchronizing, it may
    happen that kcopyd reads a piece of data, then discard is sent on the
    same area and then kcopyd writes the piece of data to another leg.
    Consequently, the data is not zeroed.

    The flag was made available by commit 983c7db347db8ce2d8453fd1d89b7a4bb6920d56
    (dm crypt: always disable discard_zeroes_data).

    Signed-off-by: Mikulas Patocka
    Cc: stable@kernel.org
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • This patch fixes a crash when a discard request is sent during mirror
    recovery.

    Firstly, some background. Generally, the following sequence happens during
    mirror synchronization:
    - function do_recovery is called
    - do_recovery calls dm_rh_recovery_prepare
    - dm_rh_recovery_prepare uses a semaphore to limit the number
    simultaneously recovered regions (by default the semaphore value is 1,
    so only one region at a time is recovered)
    - dm_rh_recovery_prepare calls __rh_recovery_prepare,
    __rh_recovery_prepare asks the log driver for the next region to
    recover. Then, it sets the region state to DM_RH_RECOVERING. If there
    are no pending I/Os on this region, the region is added to
    quiesced_regions list. If there are pending I/Os, the region is not
    added to any list. It is added to the quiesced_regions list later (by
    dm_rh_dec function) when all I/Os finish.
    - when the region is on quiesced_regions list, there are no I/Os in
    flight on this region. The region is popped from the list in
    dm_rh_recovery_start function. Then, a kcopyd job is started in the
    recover function.
    - when the kcopyd job finishes, recovery_complete is called. It calls
    dm_rh_recovery_end. dm_rh_recovery_end adds the region to
    recovered_regions or failed_recovered_regions list (depending on
    whether the copy operation was successful or not).

    The above mechanism assumes that if the region is in DM_RH_RECOVERING
    state, no new I/Os are started on this region. When I/O is started,
    dm_rh_inc_pending is called, which increases reg->pending count. When
    I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
    If the count is zero and the region was in DM_RH_RECOVERING state,
    dm_rh_dec adds it to the quiesced_regions list.

    Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
    in DM_RH_RECOVERING state, it could be added to quiesced_regions list
    multiple times or it could be added to this list when kcopyd is copying
    data (it is assumed that the region is not on any list while kcopyd does
    its jobs). This results in memory corruption and crash.

    There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
    do not belong to any region, so they are always added to the sync list
    in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
    requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
    requests. These bypasses avoid the crash possibility described above.

    These bypasses were improperly implemented for REQ_DISCARD when
    the mirror target gained discard support in commit
    5fc2ffeabb9ee0fc0e71ff16b49f34f0ed3d05b4 (dm raid1: support discard).

    In do_writes, REQ_DISCARD requests is always added to the sync queue and
    immediately dispatched (even if the region is in DM_RH_RECOVERING). However,
    dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts. So it violates the
    rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
    corruption described above.

    This patch changes it so that REQ_DISCARD requests follow the same path
    as REQ_FLUSH. This avoids the crash.

    Reference: https://bugzilla.redhat.com/837607

    Signed-off-by: Mikulas Patocka
    Cc: stable@kernel.org
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

29 Mar, 2012

1 commit

  • Device mapper uses sscanf to convert arguments to numbers. The problem is that
    the way we use it ignores additional unmatched characters in the scanned string.

    For example, this `if (sscanf(string, "%d", &number) == 1)' will match a number,
    but also it will match number with some garbage appended, like "123abc".

    As a result, device mapper accepts garbage after some numbers. For example
    the command `dmsetup create vg1-new --table "0 16384 linear 254:1bla 34816bla"'
    will pass without an error.

    This patch fixes all sscanf uses in device mapper. It appends "%c" with
    a pointer to a dummy character variable to every sscanf statement.

    The construct `if (sscanf(string, "%d%c", &number, &dummy) == 1)' succeeds
    only if string is a null-terminated number (optionally preceded by some
    whitespace characters). If there is some character appended after the number,
    sscanf matches "%c", writes the character to the dummy variable and returns 2.
    We check the return value for 1 and consequently reject numbers with some
    garbage appended.

    Signed-off-by: Mikulas Patocka
    Acked-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

29 May, 2011

3 commits

  • Return client directly from dm_kcopyd_client_create, not through a
    parameter, making it consistent with dm_io_client_create.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Reserve just the minimum of pages needed to process one job.

    Because we allocate pages from page allocator, we don't need to reserve
    a large number of pages. The maximum job size is SUB_JOB_SIZE and we
    calculate the number of reserved pages based on this.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Replace the arbitrary calculation of an initial io struct mempool size
    with a constant.

    The code calculated the number of reserved structures based on the request
    size and used a "magic" multiplication constant of 4. This patch changes
    it to reserve a fixed number - itself still chosen quite arbitrarily.
    Further testing might show if there is a better number to choose.

    Note that if there is no memory pressure, we can still allocate an
    arbitrary number of "struct io" structures. One structure is enough to
    process the whole request.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Jan, 2011

4 commits

  • kmirrord_wq, kcopyd_work and md->wq are created per dm instance and
    serve only a single work item from the dm instance, so non-reentrant
    workqueues would provide the same ordering guarantees as ordered ones
    while allowing CPU affinity and use of the workqueues for other
    purposes. Switch them to non-reentrant workqueues.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • Convert all create[_singlethread]_work() users to the new
    alloc[_ordered]_workqueue(). This conversion is mechanical and
    doesn't introduce any behavior change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • flush_scheduled_work() is being deprecated. Flush the used work
    directly instead. In all dm targets, the only work which uses
    system_wq is ->trigger_event.

    Signed-off-by: Tejun Heo
    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Tejun Heo
     
  • Enable discard support in the DM mirror target.
    Also change an existing use of 'bvec' to 'addr' in the union.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     

10 Sep, 2010

1 commit

  • This patch converts bio-based dm to support REQ_FLUSH/FUA instead of
    now deprecated REQ_HARDBARRIER.

    * -EOPNOTSUPP handling logic dropped.

    * Preflush is handled as before but postflush is dropped and replaced
    with passing down REQ_FUA to member request_queues. This replaces
    one array wide cache flush w/ member specific FUA writes.

    * __split_and_process_bio() now calls __clone_and_map_flush() directly
    for flushes and guarantees all FLUSH bio's going to targets are zero
    ` length.

    * It's now guaranteed that all FLUSH bio's which are passed onto dm
    targets are zero length. bio_empty_barrier() tests are replaced
    with REQ_FLUSH tests.

    * Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes.

    * Dropped unlikely() around REQ_FLUSH tests. Flushes are not unlikely
    enough to be marked with unlikely().

    * Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue
    doesn't support cache flushing. Advertise REQ_FLUSH | REQ_FUA
    capability.

    * Request based dm isn't converted yet. dm_init_request_based_queue()
    resets flush support to 0 for now. To avoid disturbing request
    based dm code, dm->flush_error is added for bio based dm while
    requested based dm continues to use dm->barrier_error.

    Lightly tested linear, stripe, raid1, snap and crypt targets. Please
    proceed with caution as I'm not familiar with the code base.

    Signed-off-by: Tejun Heo
    Cc: dm-devel@redhat.com
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tejun Heo
     

12 Aug, 2010

1 commit


08 Aug, 2010

1 commit

  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

06 Mar, 2010

3 commits

  • To prevent deadlock, bios in the hold list should be flushed before
    dm_rh_stop_recovery() is called in mirror_suspend().

    The recovery can't start because there are pending bios and therefore
    dm_rh_stop_recovery deadlocks.

    When there are pending bios in the hold list, the recovery waits for
    the completion of the bios after recovery_count is acquired.
    The recovery_count is released when the recovery finished, however,
    the bios in the hold list are processed after dm_rh_stop_recovery() in
    mirror_presuspend(). dm_rh_stop_recovery() also acquires recovery_count,
    then deadlock occurs.

    Signed-off-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon
    Reviewed-by: Mikulas Patocka

    Takahiro Yasui
     
  • Remove unused parameters(start and len) of dm_get_device()
    and fix the callers.

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Alasdair G Kergon

    Nikanth Karthikesan
     
  • If all mirror legs fail, always return an error instead of holding the
    bio, even if the handle_errors option was set. At present it is the
    responsibility of the driver underneath us to deal with retries,
    multipath etc.

    The patch adds the bio to the failures list instead of holding it
    directly. do_failures tests first if all legs failed and, if so,
    returns the bio with -EIO. If any leg is still alive and handle_errors
    is set, do_failures calls hold_bio.

    Reviewed-by: Takahiro Yasui
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

17 Feb, 2010

1 commit

  • If the mirror log fails when the handle_errors option was not selected
    and there is no remaining valid mirror leg, writes return success even
    though they weren't actually written to any device. This patch
    completes them with EIO instead.

    This code path is taken:
    do_writes:
    bio_list_merge(&ms->failures, &sync);
    do_failures:
    if (!get_valid_mirror(ms)) (false)
    else if (errors_handled(ms)) (false)
    else bio_endio(bio, 0);

    The logic in do_failures is based on presuming that the write was already
    tried: if it succeeded at least on one leg (without handle_errors) it
    is reported as success.

    Reference: https://bugzilla.redhat.com/show_bug.cgi?id=555197

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

11 Dec, 2009

11 commits

  • Explicitly initialize bio lists instead of relying on kzalloc.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Hold all write bios when leg fails and errors are handled

    When using a userspace daemon such as dmeventd to handle errors, we must
    delay completing bios until it has done its job.
    This patch prevents the following race:
    - primary leg fails
    - write "1" fail, the write is held, secondary leg is set default
    - write "2" goes straight to the secondary leg

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Hold all write bios when errors are handled.

    Previously the failures list was used only when handling errors with
    a userspace daemon such as dmeventd. Now, it is always used for all bios.
    The regions where some writes failed must be marked as nosync. This can only
    be done in process context (i.e. in raid1 workqueue), not in the
    write_callback function.

    Previously the write would succeed if writing to at least one leg
    succeeded. This is wrong because data from the failed leg may be
    replicated to the correct leg. Now, if using a userspace daemon, the
    write with some failures will be held until the daemon has done its job
    and reconfigured the array. If not using a daemon, the write still
    succeeds if at least one leg succeeds. This is bad, but it is consistent
    with current behavior.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move bio completion out of dm_rh_mark_nosync in preparation for the
    next patch.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Move the logic to get a valid mirror leg into a function for re-use
    in a later patch.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Use the hold framework in do_failures.

    This patch doesn't change the bio processing logic, it just simplifies
    failure handling and avoids periodically polling the failures list.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Add framework to delay bios until a suspend and then resubmit them with
    either DM_ENDIO_REQUEUE (if the suspend was noflush) or complete them
    with -EIO. I/O barrier support will use this.

    Signed-off-by: Mikulas Patocka
    Reviewed-by: Takahiro Yasui
    Tested-by: Takahiro Yasui
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Report flush errors as 'F' instead of 'D' for log and mirror devices.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Implement flush callee. It uses dm_io to send zero-size barrier synchronously
    and concurrently to all the mirror legs.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Introduce a callback pointer from the log to dm-raid1 layer.

    Before some region is set as "in-sync", we need to flush hardware cache on
    all the disks. But the log module doesn't have access to the mirror_set
    structure. So it will use this callback.

    So far the callback is unused, it will be used in further patches.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Flush support for dm-raid1.

    When it receives an empty barrier, submit it to all the devices via dm-io.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

11 Sep, 2009

1 commit


05 Sep, 2009

1 commit

  • This patch fixes a bug which was triggering a case where the primary leg
    could not be changed on failure even when the mirror was in-sync.

    The case involves the failure of the primary device along with
    the transient failure of the log device. The problem is that
    bios can be put on the 'failures' list (due to log failure)
    before 'fail_mirror' is called due to the primary device failure.
    Normally, this is fine, but if the log device failure is transient,
    a subsequent iteration of the work thread, 'do_mirror', will
    reset 'log_failure'. The 'do_failures' function then resets
    the 'in_sync' variable when processing bios on the failures list.
    The 'in_sync' variable is what is used to determine if the
    primary device can be switched in the event of a failure. Since
    this has been reset, the primary device is incorrectly assumed
    to be not switchable.

    The case has been seen in the cluster mirror context, where one
    machine realizes the log device is dead before the other machines.
    As the responsibilities of the server migrate from one node to
    another (because the mirror is being reconfigured due to the failure),
    the new server may think for a moment that the log device is fine -
    thus resetting the 'log_failure' variable.

    In any case, it is inappropiate for us to reset the 'log_failure'
    variable. The above bug simply illustrates that it can actually
    hurt us.

    Cc: stable@kernel.org
    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow
     

24 Jul, 2009

2 commits

  • Incorrect device area lengths are being passed to device_area_is_valid().

    The regression appeared in 2.6.31-rc1 through commit
    754c5fc7ebb417b23601a6222a6005cc2e7f2913.

    With the dm-stripe target, the size of the target (ti->len) was used
    instead of the stripe_width (ti->len/#stripes). An example of a
    consequent incorrect error message is:

    device-mapper: table: 254:0: sdb too small for target

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • The recent commit 7513c2a761d69d2a93f17146b3563527d3618ba0 (dm raid1:
    add is_remote_recovering hook for clusters) changed do_writes() to
    update the ms->writes list but forgot to wake up kmirrord to process it.

    The rule is that when anything is being added on ms->reads, ms->writes
    or ms->failures and the list was empty before we must call
    wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
    processing). Otherwise the bios could sit on the list indefinitely.

    Signed-off-by: Mikulas Patocka
    CC: stable@kernel.org
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

22 Jun, 2009

1 commit

  • Add .iterate_devices to 'struct target_type' to allow a function to be
    called for all devices in a DM target. Implemented it for all targets
    except those in dm-snap.c (origin and snapshot).

    (The raid1 version number jumps to 1.12 because we originally reserved
    1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors'
    instead.)

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Cc: martin.petersen@oracle.com

    Mike Snitzer
     

15 Apr, 2009

1 commit


03 Apr, 2009

1 commit

  • The logging API needs an extra function to make cluster mirroring
    possible. This new function allows us to check whether a mirror
    region is being recovered on another machine in the cluster. This
    helps us prevent simultaneous recovery I/O and process I/O to the
    same locations on disk.

    Cluster-aware log modules will implement this function. Single
    machine log modules will not. So, there is no performance
    penalty for single machine mirrors.

    Signed-off-by: Jonathan Brassow
    Acked-by: Heinz Mauelshagen
    Signed-off-by: Alasdair G Kergon

    Jonathan Brassow