09 Sep, 2009

24 commits

  • Channel switching is problematic for some dmaengine drivers as the
    architecture precludes separating the ->prep from ->submit. In these
    cases the driver can select ASYNC_TX_DISABLE_CHANNEL_SWITCH to modify
    the async_tx allocator to only return channels that support all of the
    required asynchronous operations.

    For example MD_RAID456=y selects support for asynchronous xor, xor
    validate, pq, pq validate, and memcpy. When
    ASYNC_TX_DISABLE_CHANNEL_SWITCH=y any channel with all these
    capabilities is marked DMA_ASYNC_TX allowing async_tx_find_channel() to
    quickly locate compatible channels with the guarantee that dependency
    chains will remain on one channel. When
    ASYNC_TX_DISABLE_CHANNEL_SWITCH=n async_tx_find_channel() may select
    channels that lead to operation chains that need to cross channel
    boundaries using the async_tx channel switch capability.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Some engines optimize operation by reading ahead in the descriptor chain
    such that descriptor2 may start execution before descriptor1 completes.
    If descriptor2 depends on the result from descriptor1 then a fence is
    required (on descriptor2) to disable this optimization. The async_tx
    api could implicitly identify dependencies via the 'depend_tx'
    parameter, but that would constrain cases where the dependency chain
    only specifies a completion order rather than a data dependency. So,
    provide an ASYNC_TX_FENCE to explicitly identify data dependencies.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Conflicts:
    include/linux/dmaengine.h

    Dan Williams
     
  • Handle descriptor allocation failures by polling for a descriptor. The
    driver will force forward progress when polled. In the best case this
    polling interval will be the time it takes for one dma memcpy
    transaction to complete. In the worst case, channel hang, we will need
    to wait 100ms for the cleanup watchdog to fire (ioatdma driver).

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Increment the allocation order of the descriptor ring every time we run
    out of descriptors up to a maximum of allocation order specified by the
    module parameter 'ioat_max_alloc_order'. After each idle period
    decrement the allocation order to a minimum order of
    'ioat_ring_alloc_order' (i.e. the default ring size, tunable as a module
    parameter).

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In order to support dynamic resizing of the descriptor ring or polling
    for a descriptor in the presence of a hung channel the reset handler
    needs to make progress while in a non-preemptible context. The current
    workqueue implementation precludes polling channel reset completion
    under spin_lock().

    This conversion also allows us to return to opportunistic cleanup in the
    ioat2 case as the timer implementation guarantees at least one cleanup
    after every descriptor is submitted. This means the worst case
    completion latency becomes the timer frequency (for exceptional
    circumstances), but with the benefit of avoiding busy waiting when the
    lock is contended.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Save 4 bytes per software descriptor by transmitting tx_cnt in an unused
    portion of the hardware descriptor.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Mark all single use initialization routines with __devinit.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The register write in ioat_dma_cleanup_tasklet is unfortunate in two
    ways:
    1/ It clears the extra 'enable' bits that we set at alloc_chan_resources time
    2/ It gives the impression that it disables interrupts when it is in
    fact re-arming interrupts

    [ Impact: fix, persist the value of the chanctrl register when re-arming ]

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Don't trust that the reserved bits are always zero, also sanity check
    the returned value.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The cleanup path makes an effort to only perform an atomic read of the
    64-bit completion address. However in the 32-bit case it does not
    matter if we read the upper-32 and lower-32 non-atomically because the
    upper-32 will always be zero.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Provide some output for debugging the driver.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The unified ioat1/ioat2 ioat_dma_unmap() implementation derives the
    source and dest addresses from the unmap descriptor. There is no longer
    a need to track this information in struct ioat_desc_sw.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Replace the current linked list munged into a ring with a native ring
    buffer implementation. The benefit of this approach is reduced overhead
    as many parameters can be derived from ring position with simple pointer
    comparisons and descriptor allocation/freeing becomes just a
    manipulation of head/tail pointers.

    It requires a contiguous allocation for the software descriptor
    information.

    Since this arrangement is significantly different from the ioat1 chain,
    move ioat2,3 support into its own file and header. Common routines are
    exported from driver/dma/ioat/dma.[ch].

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Prepare the code for the conversion of the ioat2 linked-list-ring into a
    native ring buffer. After this conversion ioat2 channels will share
    less of the ioat1 infrastructure, but there will still be places where
    sharing is possible. struct ioat_chan_common is created to house the
    channel attributes that will remain common between ioat1 and ioat2
    channels.

    For every routine that accesses both common and hardware specific fields
    the old unified 'ioat_chan' pointer is split into an 'ioat' and 'chan'
    pointer. Where 'chan' references common fields and 'ioat' the
    hardware/version specific.

    [ Impact: pure structure member movement/variable renames, no logic changes ]

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • If a callback is to be attached to a descriptor the channel needs to
    know at ->prep time so it can set the interrupt enable bit. This is in
    preparation for moving descriptor ioat2 descriptor preparation from
    ->submit to ->prep.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The async_tx api assumes that after a successful ->prep a subsequent
    ->submit will not fail due to a lack of resources.

    This also fixes a bug in the allocation failure case. Previously the
    descriptors allocated prior to the allocation failure would not be
    returned to the free list.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • This cleans up a mess of and'ing and or'ing bit definitions, and allows
    simple assignments from the specified dma_ctrl_flags parameter.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • ->dmacount tracks the sequence number of active descriptors. It is
    written to the DMACOUNT register to update the channel's view of pending
    descriptors in the chain. The register is 16-bits so ->dmacount should
    be unsigned and 16-bit as well. Also modify ->desccount to maintain
    alignment.

    This was never a problem in practice because we never compared dmacount
    values, but this is a bug waiting to happen.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Towards the removal of ioatdma_device.version split the initialization
    path into distinct versions. This conversion:
    1/ moves version specific probe code to version specific routines
    2/ removes the need for ioat_device
    3/ turns off the ioat1 msi quirk if the device is reinitialized for intx

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The only .c files that utilize these protected prototypes depend on
    CONFIG_INTEL_IOATDMA=y, so there is no value gained in providing empty
    prototypes.

    [ Impact: pure cleanup ]

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • * reduce device->common. to dma-> in ioat_dma_{probe,remove,selftest}
    * ioat_lookup_chan_by_index to ioat_chan_by_index
    * multi-line function definitions
    * ioat_desc_sw.async_tx to ioat_desc_sw.txd
    * desc->txd. to tx-> in cleanup routine

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The driver currently duplicates much of what these routines offer, so
    just use the common code. For example ->irq_mode tracks what interrupt
    mode was initialized, which duplicates the ->msix_enabled and
    ->msi_enabled handling in pcim_release.

    This also adds a check to the return value of dma_async_device_register,
    which can fail.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Some of these defines may be useful outside of dma.c and the header is
    private so there are no namespace pollution concerns.

    Signed-off-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     

30 Aug, 2009

16 commits

  • Now that the resources to handle stripe_head operations are allocated
    percpu it is possible for raid5d to distribute stripe handling over
    multiple cores. This conversion also adds a call to cond_resched() in
    the non-multicore case to prevent one core from getting monopolized for
    raid operations.

    Cc: Arjan van de Ven
    Signed-off-by: Dan Williams

    Dan Williams
     
  • These routines have been replaced by there asynchronous counterparts.

    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Signed-off-by: Dan Williams

    Yuri Tikhonov
     
  • 1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to
    raid_run_ops
    2/ Implement a handler for sh->reconstruct_state similar to the raid5 case
    (adds handling of Q parity)
    3/ Prevent handle_parity_checks6 from running concurrently with 'compute'
    operations
    4/ Hook up raid_run_ops

    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Signed-off-by: Dan Williams

    Yuri Tikhonov
     
  • [ Based on an original patch by Yuri Tikhonov ]

    Implement the state machine for handling the RAID-6 parities check and
    repair functionality. Note that the raid6 case does not need to check
    for new failures, like raid5, as it will always writeback the correct
    disks. The raid5 case can be updated to check zero_sum_result to avoid
    getting confused by new failures rather than retrying the entire check
    operation.

    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In the synchronous implementation of stripe dirtying we processed a
    degraded stripe with one call to handle_stripe_dirtying6(). I.e.
    compute the missing blocks from the other drives, then copy in the new
    data and reconstruct the parities.

    In the asynchronous case we do not perform stripe operations directly.
    Instead, operations are scheduled with flags to be later serviced by
    raid_run_ops. So, for the degraded case the final reconstruction step
    can only be carried out after all blocks have been brought up to date by
    being read, or computed. Like the raid5 case schedule_reconstruction()
    sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and
    through operation chaining can handle compute and reconstruct in a
    single raid_run_ops pass.

    [dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating]
    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Signed-off-by: Dan Williams

    Yuri Tikhonov
     
  • Modify handle_stripe_fill6 to work asynchronously by introducing
    fetch_block6 as the raid6 analog of fetch_block5 (schedule compute
    operations for missing/out-of-sync disks).

    [dan.j.williams@intel.com: compute D+Q in one pass]
    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Signed-off-by: Dan Williams

    Yuri Tikhonov
     
  • Extend schedule_reconstruction5 for reuse by the raid6 path. Add
    support for generating Q and BUG() if a request is made to perform
    'prexor'.

    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Signed-off-by: Dan Williams

    Yuri Tikhonov
     
  • [ Based on an original patch by Yuri Tikhonov ]

    The raid_run_ops routine uses the asynchronous offload api and
    the stripe_operations member of a stripe_head to carry out xor+pq+copy
    operations asynchronously, outside the lock.

    The operations performed by RAID-6 are the same as in the RAID-5 case
    except for no support of STRIPE_OP_PREXOR operations. All the others
    are supported:
    STRIPE_OP_BIOFILL
    - copy data into request buffers to satisfy a read request
    STRIPE_OP_COMPUTE_BLK
    - generate missing blocks (1 or 2) in the cache from the other blocks
    STRIPE_OP_BIODRAIN
    - copy data out of request buffers to satisfy a write request
    STRIPE_OP_RECONSTRUCT
    - recalculate parity for new data that has entered the cache
    STRIPE_OP_CHECK
    - verify that the parity is correct

    The flow is the same as in the RAID-5 case, and reuses some routines, namely:
    1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
    2/ ops_complete_compute (updated to set up to 2 targets uptodate)
    3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)

    [neilb@suse.de: fixes to get it to pass mdadm regression suite]
    Reviewed-by: Andre Noll
    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Signed-off-by: Dan Williams

    Dan Williams
     
  • ops_complete_compute5 can be reused in the raid6 path if it is updated to
    generically handle a second target.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Port drivers/md/raid6test/test.c to use the async raid6 recovery
    routines. This is meant as a unit test for raid6 acceleration drivers. In
    addition to the 16-drive test case this implements tests for the 4-disk and
    5-disk special cases (dma devices can not generically handle less than 2
    sources), and adds a test for the D+Q case.

    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Test raid6 p+q operations with a simple "always multiply by 1" q
    calculation to fit into dmatest's current destination verification
    scheme.

    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • async_raid6_2data_recov() recovers two data disk failures

    async_raid6_datap_recov() recovers a data disk and the P disk

    These routines are a port of the synchronous versions found in
    drivers/md/raid6recov.c. The primary difference is breaking out the xor
    operations into separate calls to async_xor. Two helper routines are
    introduced to perform scalar multiplication where needed.
    async_sum_product() multiplies two sources by scalar coefficients and
    then sums (xor) the result. async_mult() simply multiplies a single
    source by a scalar.

    This implemention also includes, in contrast to the original
    synchronous-only code, special case handling for the 4-disk and 5-disk
    array cases. In these situations the default N-disk algorithm will
    present 0-source or 1-source operations to dma devices. To cover for
    dma devices where the minimum source count is 2 we implement 4-disk and
    5-disk handling in the recovery code.

    [ Impact: asynchronous raid6 recovery routines for 2data and datap cases ]

    Cc: Yuri Tikhonov
    Cc: Ilya Yanok
    Cc: H. Peter Anvin
    Cc: David Woodhouse
    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • [ Based on an original patch by Yuri Tikhonov ]

    This adds support for doing asynchronous GF multiplication by adding
    two additional functions to the async_tx API:

    async_gen_syndrome() does simultaneous XOR and Galois field
    multiplication of sources.

    async_syndrome_val() validates the given source buffers against known P
    and Q values.

    When a request is made to run async_pq against more than the hardware
    maximum number of supported sources we need to reuse the previous
    generated P and Q values as sources into the next operation. Care must
    be taken to remove Q from P' and P from Q'. For example to perform a 5
    source pq op with hardware that only supports 4 sources at a time the
    following approach is taken:

    p, q = PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08}))
    p', q' = PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10}))

    p' = p + q + q + src4 = p + src4
    q' = {00}*p + {01}*q + {00}*q + {10}*src4 = q + {10}*src4

    Note: 4 is the minimum acceptable maxpq otherwise we punt to
    synchronous-software path.

    The DMA_PREP_CONTINUE flag indicates to the driver to reuse p and q as
    sources (in the above manner) and fill the remaining slots up to maxpq
    with the new sources/coefficients.

    Note1: Some devices have native support for P+Q continuation and can skip
    this extra work. Devices with this capability can advertise it with
    dma_set_maxpq. It is up to each driver how to handle the
    DMA_PREP_CONTINUE flag.

    Note2: The api supports disabling the generation of P when generating Q,
    this is ignored by the synchronous path but is implemented by some dma
    devices to save unnecessary writes. In this case the continuation
    algorithm is simplified to only reuse Q as a source.

    Cc: H. Peter Anvin
    Cc: David Woodhouse
    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • We currently walk the parent chain when waiting for a given tx to
    complete however this walk may race with the driver cleanup routine.
    The routines in async_raid6_recov.c may fall back to the synchronous
    path at any point so we need to be prepared to call async_tx_quiesce()
    (which calls dma_wait_for_async_tx). To remove the ->parent walk we
    guarantee that every time a dependency is attached ->issue_pending() is
    invoked, then we can simply poll the initial descriptor until
    completion.

    This also allows for a lighter weight 'issue pending' implementation as
    there is no longer a requirement to iterate through all the channels'
    ->issue_pending() routines as long as operations have been submitted in
    an ordered chain. async_tx_issue_pending() is added for this case.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • If module_init and module_exit are nops then neither need to be defined.

    [ Impact: pure cleanup ]

    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Replace the flat zero_sum_result with a collection of flags to contain
    the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
    solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead
    of DMA_ since these flags will be used on non-dma-zero-sum enabled
    platforms.

    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams