04 Jul, 2013

1 commit

  • There have never been any real users of MEMSET operations since they
    have been introduced in January 2007 by commit 7405f74badf4 ("dmaengine:
    refactor dmaengine around dma_async_tx_descriptor"). Therefore remove
    support for them for now, it can be always brought back when needed.

    [sebastian.hesselbarth@gmail.com: fix drivers/dma/mv_xor]
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Signed-off-by: Sebastian Hesselbarth
    Cc: Vinod Koul
    Acked-by: Dan Williams
    Cc: Tomasz Figa
    Cc: Herbert Xu
    Cc: Olof Johansson
    Cc: Kevin Hilman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     

30 Aug, 2009

3 commits

  • Port drivers/md/raid6test/test.c to use the async raid6 recovery
    routines. This is meant as a unit test for raid6 acceleration drivers. In
    addition to the 16-drive test case this implements tests for the 4-disk and
    5-disk special cases (dma devices can not generically handle less than 2
    sources), and adds a test for the D+Q case.

    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • async_raid6_2data_recov() recovers two data disk failures

    async_raid6_datap_recov() recovers a data disk and the P disk

    These routines are a port of the synchronous versions found in
    drivers/md/raid6recov.c. The primary difference is breaking out the xor
    operations into separate calls to async_xor. Two helper routines are
    introduced to perform scalar multiplication where needed.
    async_sum_product() multiplies two sources by scalar coefficients and
    then sums (xor) the result. async_mult() simply multiplies a single
    source by a scalar.

    This implemention also includes, in contrast to the original
    synchronous-only code, special case handling for the 4-disk and 5-disk
    array cases. In these situations the default N-disk algorithm will
    present 0-source or 1-source operations to dma devices. To cover for
    dma devices where the minimum source count is 2 we implement 4-disk and
    5-disk handling in the recovery code.

    [ Impact: asynchronous raid6 recovery routines for 2data and datap cases ]

    Cc: Yuri Tikhonov
    Cc: Ilya Yanok
    Cc: H. Peter Anvin
    Cc: David Woodhouse
    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • [ Based on an original patch by Yuri Tikhonov ]

    This adds support for doing asynchronous GF multiplication by adding
    two additional functions to the async_tx API:

    async_gen_syndrome() does simultaneous XOR and Galois field
    multiplication of sources.

    async_syndrome_val() validates the given source buffers against known P
    and Q values.

    When a request is made to run async_pq against more than the hardware
    maximum number of supported sources we need to reuse the previous
    generated P and Q values as sources into the next operation. Care must
    be taken to remove Q from P' and P from Q'. For example to perform a 5
    source pq op with hardware that only supports 4 sources at a time the
    following approach is taken:

    p, q = PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08}))
    p', q' = PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10}))

    p' = p + q + q + src4 = p + src4
    q' = {00}*p + {01}*q + {00}*q + {10}*src4 = q + {10}*src4

    Note: 4 is the minimum acceptable maxpq otherwise we punt to
    synchronous-software path.

    The DMA_PREP_CONTINUE flag indicates to the driver to reuse p and q as
    sources (in the above manner) and fill the remaining slots up to maxpq
    with the new sources/coefficients.

    Note1: Some devices have native support for P+Q continuation and can skip
    this extra work. Devices with this capability can advertise it with
    dma_set_maxpq. It is up to each driver how to handle the
    DMA_PREP_CONTINUE flag.

    Note2: The api supports disabling the generation of P when generating Q,
    this is ignored by the synchronous path but is implemented by some dma
    devices to save unnecessary writes. In this case the continuation
    algorithm is simplified to only reuse Q as a source.

    Cc: H. Peter Anvin
    Cc: David Woodhouse
    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     

13 Jul, 2007

1 commit

  • The async_tx api provides methods for describing a chain of asynchronous
    bulk memory transfers/transforms with support for inter-transactional
    dependencies. It is implemented as a dmaengine client that smooths over
    the details of different hardware offload engine implementations. Code
    that is written to the api can optimize for asynchronous operation and the
    api will fit the chain of operations to the available offload resources.

    I imagine that any piece of ADMA hardware would register with the
    'async_*' subsystem, and a call to async_X would be routed as
    appropriate, or be run in-line. - Neil Brown

    async_tx exploits the capabilities of struct dma_async_tx_descriptor to
    provide an api of the following general format:

    struct dma_async_tx_descriptor *
    async_(..., struct dma_async_tx_descriptor *depend_tx,
    dma_async_tx_callback cb_fn, void *cb_param)
    {
    struct dma_chan *chan = async_tx_find_channel(depend_tx, );
    struct dma_device *device = chan ? chan->device : NULL;
    int int_en = cb_fn ? 1 : 0;
    struct dma_async_tx_descriptor *tx = device ?
    device->device_prep_dma_(chan, len, int_en) : NULL;

    if (tx) { /* run asynchronously */
    ...
    tx->tx_set_dest(addr, tx, index);
    ...
    tx->tx_set_src(addr, tx, index);
    ...
    async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
    } else { /* run synchronously */
    ...

    ...
    async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param);
    }

    return tx;
    }

    async_tx_find_channel() returns a capable channel from its pool. The
    channel pool is organized as a per-cpu array of channel pointers. The
    async_tx_rebalance() routine is tasked with managing these arrays. In the
    uniprocessor case async_tx_rebalance() tries to spread responsibility
    evenly over channels of similar capabilities. For example if there are two
    copy+xor channels, one will handle copy operations and the other will
    handle xor. In the SMP case async_tx_rebalance() attempts to spread the
    operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor
    channel0 while cpu1 gets copy channel 1 and xor channel 1. When a
    dependency is specified async_tx_find_channel defaults to keeping the
    operation on the same channel. A xor->copy->xor chain will stay on one
    channel if it supports both operation types, otherwise the transaction will
    transition between a copy and a xor resource.

    Currently the raid5 implementation in the MD raid456 driver has been
    converted to the async_tx api. A driver for the offload engines on the
    Intel Xscale series of I/O processors, iop-adma, is provided in a later
    commit. With the iop-adma driver and async_tx, raid456 is able to offload
    copy, xor, and xor-zero-sum operations to hardware engines.

    On iop342 tiobench showed higher throughput for sequential writes (20 - 30%
    improvement) and sequential reads to a degraded array (40 - 55%
    improvement). For the other cases performance was roughly equal, +/- a few
    percentage points. On a x86-smp platform the performance of the async_tx
    implementation (in synchronous mode) was also +/- a few percentage points
    of the original implementation. According to 'top' on iop342 CPU
    utilization drops from ~50% to ~15% during a 'resync' while the speed
    according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s.

    The tiobench command line used for testing was: tiobench --size 2048
    --block 4096 --block 131072 --dir /mnt/raid --numruns 5
    * iop342 had 1GB of memory available

    Details:
    * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making
    async_tx_find_channel a static inline routine that always returns NULL
    * when a callback is specified for a given transaction an interrupt will
    fire at operation completion time and the callback will occur in a
    tasklet. if the the channel does not support interrupts then a live
    polling wait will be performed
    * the api is written as a dmaengine client that requests all available
    channels
    * In support of dependencies the api implicitly schedules channel-switch
    interrupts. The interrupt triggers the cleanup tasklet which causes
    pending operations to be scheduled on the next channel
    * Xor engines treat an xor destination address differently than a software
    xor routine. To the software routine the destination address is an implied
    source, whereas engines treat it as a write-only destination. This patch
    modifies the xor_blocks routine to take a an explicit destination address
    to mirror the hardware.

    Changelog:
    * fixed a leftover debug print
    * don't allow callbacks in async_interrupt_cond
    * fixed xor_block changes
    * fixed usage of ASYNC_TX_XOR_DROP_DEST
    * drop dma mapping methods, suggested by Chris Leech
    * printk warning fixups from Andrew Morton
    * don't use inline in C files, Adrian Bunk
    * select the API when MD is enabled
    * BUG_ON xor source counts
    Signed-off-by: Dan Williams
    Acked-By: NeilBrown

    Dan Williams