24 Jul, 2009

3 commits

  • Incorrect device area lengths are being passed to device_area_is_valid().

    The regression appeared in 2.6.31-rc1 through commit
    754c5fc7ebb417b23601a6222a6005cc2e7f2913.

    With the dm-stripe target, the size of the target (ti->len) was used
    instead of the stripe_width (ti->len/#stripes). An example of a
    consequent incorrect error message is:

    device-mapper: table: 254:0: sdb too small for target

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • This patch removes DM's bio-based vs request-based conditional setting
    of next_ordered. For bio-based DM the next_ordered check is no longer a
    concern (as that check is now in the __make_request path). For
    request-based DM the default of QUEUE_ORDERED_NONE is now appropriate.

    bio-based DM was changed to work-around the previously misplaced
    next_ordered check with this commit:
    99360b4c18f7675b50d283301d46d755affe75fd

    request-based DM does not yet support barriers but reacted to the above
    bio-based DM change with this commit:
    5d67aa2366ccb8257d103d0b43df855605c3c086

    The above changes are no longer needed given Neil Brown's recent fix to
    put the next_ordered check in the __make_request path:
    db64f680ba4b5c56c4be59f0698000df89ff0281

    Signed-off-by: Mike Snitzer
    Cc: Jun'ichi Nomura
    Cc: NeilBrown
    Acked-by: Kiyoshi Ueda
    Acked-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • The recent commit 7513c2a761d69d2a93f17146b3563527d3618ba0 (dm raid1:
    add is_remote_recovering hook for clusters) changed do_writes() to
    update the ms->writes list but forgot to wake up kmirrord to process it.

    The rule is that when anything is being added on ms->reads, ms->writes
    or ms->failures and the list was empty before we must call
    wakeup_mirrord (for immediate processing) or delayed_wake (for delayed
    processing). Otherwise the bios could sit on the list indefinitely.

    Signed-off-by: Mikulas Patocka
    CC: stable@kernel.org
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     

11 Jul, 2009

1 commit


09 Jul, 2009

1 commit

  • Commit 5fd29d6ccbc98884569d6f3105aeca70858b3e0f ("printk: clean up
    handling of log-levels and newlines") changed printk semantics. printk
    lines with multiple KERN_ prefixes are no longer emitted as
    before the patch.

    is now included in the output on each additional use.

    Remove all uses of multiple KERN_s in formats.

    Signed-off-by: Joe Perches
    Signed-off-by: Linus Torvalds

    Joe Perches
     

02 Jul, 2009

2 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    cfq-iosched: remove redundant check for NULL cfqq in cfq_set_request()
    blocK: Restore barrier support for md and probably other virtual devices.
    block: get rid of queue-private command filter
    block: Create bip slabs with embedded integrity vectors
    cfq-iosched: get rid of the need for __GFP_NOFAIL in cfq_find_alloc_queue()
    cfq-iosched: move cfqq initialization out of cfq_find_alloc_queue()
    Trivial typo fixes in Documentation/block/data-integrity.txt.

    Linus Torvalds
     
  • * 'for-linus' of git://neil.brown.name/md:
    md: use interruptible wait when duration is controlled by userspace.
    md/raid5: suspend shouldn't affect read requests.
    md: tidy up error paths in md_alloc
    md: fix error path when duplicate name is found on md device creation.
    md: avoid dereferencing NULL pointer when accessing suspend_* sysfs attributes.
    md: Use new topology calls to indicate alignment and I/O sizes

    Linus Torvalds
     

01 Jul, 2009

7 commits


30 Jun, 2009

2 commits

  • The offset passed to blk_stack_limits() must be in bytes not sectors.
    Fixes false warnings like the following:
    device-mapper: table: 254:1: target device sda6 is misaligned

    Signed-off-by: Mike Snitzer
    Reported-by: Frans Pop
    Tested-by: Frans Pop
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Fix exception store name handling.

    We need to reference exception store by zero terminated string.

    Fixes regression introduced in commit f6bd4eb73cdf2a5bf954e497972842f39cabb7e3

    Cc: Yi Yang
    Cc: Jonathan Brassow
    Cc: stable@kernel.org
    Cc: Andrew Morton
    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     

22 Jun, 2009

24 commits

  • This patch converts dm-multipath target to request-based from bio-based.

    Basically, the patch just converts the I/O unit from struct bio
    to struct request.
    In the course of the conversion, it also changes the I/O queueing
    mechanism. The change in the I/O queueing is described in details
    as follows.

    I/O queueing mechanism change
    -----------------------------
    In I/O submission, map_io(), there is no mechanism change from
    bio-based, since the clone request is ready for retry as it is.
    However, in I/O complition, do_end_io(), there is a mechanism change
    from bio-based, since the clone request is not ready for retry.

    In do_end_io() of bio-based, the clone bio has all needed memory
    for resubmission. So the target driver can queue it and resubmit
    it later without memory allocations.
    The mechanism has almost no overhead.

    On the other hand, in do_end_io() of request-based, the clone request
    doesn't have clone bios, so the target driver can't resubmit it
    as it is. To resubmit the clone request, memory allocation for
    clone bios is needed, and it takes some overheads.
    To avoid the overheads just for queueing, the target driver doesn't
    queue the clone request inside itself.
    Instead, the target driver asks dm core for queueing and remapping
    the original request of the clone request, since the overhead for
    queueing is just a freeing memory for the clone request.

    As a result, the target driver doesn't need to record/restore
    the information of the original request for resubmitting
    the clone request. So dm_bio_details in dm_mpath_io is removed.

    multipath_busy()
    ---------------------
    The target driver returns "busy", only when the following case:
    o The target driver will map I/Os, if map() function is called
    and
    o The mapped I/Os will wait on underlying device's queue due to
    their congestions, if map() function is called now.

    In other cases, the target driver doesn't return "busy".
    Otherwise, dm core will keep the I/Os and the target driver can't
    do what it wants.
    (e.g. the target driver can't map I/Os now, so wants to kill I/Os.)

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Hannes Reinecke
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch disables interrupt when taking map_lock to avoid
    lockdep warnings in request-based dm.

    request-based dm takes map_lock after taking queue_lock with
    disabling interrupt:
    spin_lock_irqsave(queue_lock)
    q->request_fn() == dm_request_fn()
    => dm_get_table()
    => read_lock(map_lock)
    while queue_lock could be (but isn't) taken in interrupt context.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Christof Schmitt
    Acked-by: Hannes Reinecke
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • Request-based dm doesn't have barrier support yet.
    So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm.
    Since the device type is decided at the first table loading time,
    the flag set is deferred until then.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Hannes Reinecke
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch enables request-based dm.

    o Request-based dm and bio-based dm coexist, since there are
    some target drivers which are more fitting to bio-based dm.
    Also, there are other bio-based devices in the kernel
    (e.g. md, loop).
    Since bio-based device can't receive struct request,
    there are some limitations on device stacking between
    bio-based and request-based.

    type of underlying device
    bio-based request-based
    ----------------------------------------------
    bio-based OK OK
    request-based -- OK

    The device type is recognized by the queue flag in the kernel,
    so dm follows that.

    o The type of a dm device is decided at the first table binding time.
    Once the type of a dm device is decided, the type can't be changed.

    o Mempool allocations are deferred to at the table loading time, since
    mempools for request-based dm are different from those for bio-based
    dm and needed mempool type is fixed by the type of table.

    o Currently, request-based dm supports only tables that have a single
    target. To support multiple targets, we need to support request
    splitting or prevent bio/request from spanning multiple targets.
    The former needs lots of changes in the block layer, and the latter
    needs that all target drivers support merge() function.
    Both will take a time.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch adds core functions for request-based dm.

    When struct mapped device (md) is initialized, md->queue has
    an I/O scheduler and the following functions are used for
    request-based dm as the queue functions:
    make_request_fn: dm_make_request()
    pref_fn: dm_prep_fn()
    request_fn: dm_request_fn()
    softirq_done_fn: dm_softirq_done()
    lld_busy_fn: dm_lld_busy()
    Actual initializations are done in another patch (PATCH 2).

    Below is a brief summary of how request-based dm behaves, including:
    - making request from bio
    - cloning, mapping and dispatching request
    - completing request and bio
    - suspending md
    - resuming md

    bio to request
    ==============
    md->queue->make_request_fn() (dm_make_request()) calls __make_request()
    for a bio submitted to the md.
    Then, the bio is kept in the queue as a new request or merged into
    another request in the queue if possible.

    Cloning and Mapping
    ===================
    Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
    when requests are dispatched after they are sorted by the I/O scheduler.

    dm_request_fn() checks busy state of underlying devices using
    target's busy() function and stops dispatching requests to keep them
    on the dm device's queue if busy.
    It helps better I/O merging, since no merge is done for a request
    once it is dispatched to underlying devices.

    Actual cloning and mapping are done in dm_prep_fn() and map_request()
    called from dm_request_fn().
    dm_prep_fn() clones not only request but also bios of the request
    so that dm can hold bio completion in error cases and prevent
    the bio submitter from noticing the error.
    (See the "Completion" section below for details.)

    After the cloning, the clone is mapped by target's map_rq() function
    and inserted to underlying device's queue using
    blk_insert_cloned_request().

    Completion
    ==========
    Request completion can be hooked by rq->end_io(), but then, all bios
    in the request will have been completed even error cases, and the bio
    submitter will have noticed the error.
    To prevent the bio completion in error cases, request-based dm clones
    both bio and request and hooks both bio->bi_end_io() and rq->end_io():
    bio->bi_end_io(): end_clone_bio()
    rq->end_io(): end_clone_request()

    Summary of the request completion flow is below:
    blk_end_request() for a clone request
    => blk_update_request()
    => bio->bi_end_io() == end_clone_bio() for each clone bio
    => Free the clone bio
    => Success: Complete the original bio (blk_update_request())
    Error: Don't complete the original bio
    => blk_finish_request()
    => rq->end_io() == end_clone_request()
    => blk_complete_request()
    => dm_softirq_done()
    => Free the clone request
    => Success: Complete the original request (blk_end_request())
    Error: Requeue the original request

    end_clone_bio() completes the original request on the size of
    the original bio in successful cases.
    Even if all bios in the original request are completed by that
    completion, the original request must not be completed yet to keep
    the ordering of request completion for the stacking.
    So end_clone_bio() uses blk_update_request() instead of
    blk_end_request().
    In error cases, end_clone_bio() doesn't complete the original bio.
    It just frees the cloned bio and gives over the error handling to
    end_clone_request().

    end_clone_request(), which is called with queue lock held, completes
    the clone request and the original request in a softirq context
    (dm_softirq_done()), which has no queue lock, to avoid a deadlock
    issue on submission of another request during the completion:
    - The submitted request may be mapped to the same device
    - Request submission requires queue lock, but the queue lock
    has been held by itself and it doesn't know that

    The clone request has no clone bio when dm_softirq_done() is called.
    So target drivers can't resubmit it again even error cases.
    Instead, they can ask dm core for requeueing and remapping
    the original request in that cases.

    suspend
    =======
    Request-based dm uses stopping md->queue as suspend of the md.
    For noflush suspend, just stops md->queue.

    For flush suspend, inserts a marker request to the tail of md->queue.
    And dispatches all requests in md->queue until the marker comes to
    the front of md->queue. Then, stops dispatching request and waits
    for the all dispatched requests to complete.
    After that, completes the marker request, stops md->queue and
    wake up the waiter on the suspend queue, md->wait.

    resume
    ======
    Starts md->queue.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch contains a device-mapper mirror log module that forwards
    requests to userspace for processing.

    The structures used for communication between kernel and userspace are
    located in include/linux/dm-log-userspace.h. Due to the frequency,
    diversity, and 2-way communication nature of the exchanges between
    kernel and userspace, 'connector' was chosen as the interface for
    communication.

    The first log implementations written in userspace - "clustered-disk"
    and "clustered-core" - support clustered shared storage. A userspace
    daemon (in the LVM2 source code repository) uses openAIS/corosync to
    process requests in an ordered fashion with the rest of the nodes in the
    cluster so as to prevent log state corruption. Other implementations
    with no association to LVM or openAIS/corosync, are certainly possible.

    (Imagine if two machines are writing to the same region of a mirror.
    They would both mark the region dirty, but you need a cluster-aware
    entity that can handle properly marking the region clean when they are
    done. Otherwise, you might clear the region when the first machine is
    done, not the second.)

    Signed-off-by: Jonathan Brassow
    Cc: Evgeniy Polyakov
    Signed-off-by: Alasdair G Kergon

    Jonthan Brassow
     
  • Currently, device-mapper maintains a separate instance of 'struct
    queue_limits' for each table of each device. When the configuration of
    a device is to be changed, first its table is loaded and this structure
    is populated, then the device is 'resumed' and the calculated
    queue_limits are applied.

    This places restrictions on how userspace may process related devices,
    where it is often advantageous to 'load' tables for several devices
    at once before 'resuming' them together. As the new queue_limits
    only take effect after the 'resume', if they are changing and one
    device uses another, the latter must be 'resumed' before the former
    may be 'loaded'.

    This patch moves the calculation of these queue_limits out of
    the 'load' operation into 'resume'. Since we are no longer
    pre-calculating this struct, we no longer need to maintain copies
    within our dm structs.

    dm_set_device_limits() now passes the 'start' of the device's
    data area (aka pe_start) as the 'offset' to blk_stack_limits().

    init_valid_queue_limits() is replaced by blk_set_default_limits().

    Signed-off-by: Mike Snitzer
    Cc: martin.petersen@oracle.com
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • create_log_context() must use the logical_block_size from the log disk,
    where the I/O happens, not the target's logical_block_size.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Add .iterate_devices to 'struct target_type' to allow a function to be
    called for all devices in a DM target. Implemented it for all targets
    except those in dm-snap.c (origin and snapshot).

    (The raid1 version number jumps to 1.12 because we originally reserved
    1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors'
    instead.)

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon
    Cc: martin.petersen@oracle.com

    Mike Snitzer
     
  • Copy the table's queue_limits to the DM device's request_queue. This
    properly initializes the queue's topology limits and also avoids having
    to track the evolution of 'struct queue_limits' in
    dm_table_set_restrictions()

    Also fixes a bug that was introduced in dm_table_set_restrictions() via
    commit ae03bf639a5027d27270123f5f6e3ee6a412781d. In addition to
    establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit()
    also performs an allocation to setup the ISA DMA pool. This allocation
    resulted in "sleeping function called from invalid context" when called
    from dm_table_set_restrictions().

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Use blk_stack_limits() to stack block limits (including topology) rather
    than duplicate the equivalent within Device Mapper.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Impose necessary and sufficient conditions on a devices's table such
    that any incoming bio which respects its logical_block_size can be
    processed successfully.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Ensure I/O is aligned to the logical block size of target devices.

    Rename check_device_area() to device_area_is_valid() for clarity and
    establish the device limits including the logical block size prior to
    calling it.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Alasdair G Kergon

    Mike Snitzer
     
  • Add support for passing a 32 bit "cookie" into the kernel with the
    DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls. The (unsigned)
    value of this cookie is returned to userspace alongside the uevents
    issued by these ioctls in the variable DM_COOKIE.

    This means the userspace process issuing these ioctls can be notified
    by udev after udev has completed any actions triggered.

    To minimise the interface extension, we pass the cookie into the
    kernel in the event_nr field which is otherwise unused when calling
    these ioctls. Incrementing the version number allows userspace to
    determine in advance whether or not the kernel supports the cookie.
    If the kernel does support this but userspace does not, there should
    be no impact as the new variable will just get ignored.

    Signed-off-by: Milan Broz
    Signed-off-by: Alasdair G Kergon

    Milan Broz
     
  • Add a file named 'suspended' to each device-mapper device directory in
    sysfs. It holds the value 1 while the device is suspended. Otherwise
    it holds 0.

    Signed-off-by: Peter Rajnoha
    Signed-off-by: Alasdair G Kergon

    Peter Rajnoha
     
  • Report any devices forgotten to be freed before a table is destroyed.

    Signed-off-by: Jonathan Brassow
    Signed-off-by: Alasdair G Kergon

    Jonthan Brassow
     
  • This patch adds a service time oriented dynamic load balancer,
    dm-service-time, which selects the path with the shortest estimated
    service time for the incoming I/O.
    The service time is estimated by dividing the in-flight I/O size
    by a performance value of each path.

    The performance value can be given as a table argument at the table
    loading time. If no performance value is given, all paths are
    considered equal.

    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch adds a dynamic load balancer, dm-queue-length, which
    balances the number of in-flight I/Os across the paths.

    The code is based on the patch posted by Stefan Bader:
    https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html

    Signed-off-by: Stefan Bader
    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • This patch makes two additions to the dm path selector interface for
    dynamic load balancers:
    o a new hook, start_io()
    o a new parameter 'nr_bytes' to select_path()/start_io()/end_io()
    to pass the size of the I/O

    start_io() is called when a target driver actually submits I/O
    to the selected path.
    Path selectors can use it to start accounting of the I/O.
    (e.g. counting the number of in-flight I/Os.)
    The start_io hook is based on the patch posted by Stefan Bader:
    https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html

    nr_bytes, the size of the I/O, is so path selectors can take the
    size of the I/O into account when deciding which path to use.
    dm-service-time uses it to estimate service time, for example.
    (Added the nr_bytes member to dm_mpath_io instead of using existing
    details.bi_size, since request-based dm patch deletes it.)

    Signed-off-by: Stefan Bader
    Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Alasdair G Kergon

    Kiyoshi Ueda
     
  • Send barrier requests when updating the exception area.

    Exception area updates need to be ordered w.r.t. data writes, so that
    the writes are not reordered in hardware disk cache.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • If -EOPNOTSUPP was returned and the request was a barrier request, retry it
    without barrier.

    Retry all regions for now. Barriers are submitted only for one-region requests,
    so it doesn't matter. (In the future, retries can be limited to the actual
    regions that failed.)

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Add another field, eopnotsupp_bits. It is subset of error_bits, representing
    regions that returned -EOPNOTSUPP. (The bit is set in both error_bits and
    eopnotsupp_bits).

    This value will be used in further patches.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Flush support for dm-snapshot target.

    This patch just forwards the flush request to either the origin or the snapshot
    device. (It doesn't flush exception store metadata.)

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka
     
  • Flush support for dm-multipath target.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Alasdair G Kergon

    Mikulas Patocka