22 Apr, 2015

1 commit

  • Glue it altogehter. The raid6 rmw path should work the same as the
    already existing raid5 logic. So emulate the prexor handling/flags
    and split functions as needed.

    1) Enable xor_syndrome() in the async layer.

    2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
    at the start of a rmw run as we did it before for the single parity.

    3) Take care of rmw run in ops_run_reconstruct6(). Again process only
    the changed pages to get syndrome back into sync.

    4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
    run. The lower layers will calculate start & end pages from that and
    call the xor_syndrome() correspondingly.

    5) Adapt the several places where we ignored Q handling up to now.

    Performance numbers for a single E5630 system with a mix of 10 7200k
    desktop/server disks. 300 seconds random write with 8 threads onto a
    3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)

    bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
    skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
    4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
    8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
    16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
    32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
    64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
    128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
    256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
    512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s

    Signed-off-by: Markus Stockhausen
    Signed-off-by: NeilBrown

    Markus Stockhausen
     

04 Jul, 2013

1 commit

  • There have never been any real users of MEMSET operations since they
    have been introduced in January 2007 by commit 7405f74badf4 ("dmaengine:
    refactor dmaengine around dma_async_tx_descriptor"). Therefore remove
    support for them for now, it can be always brought back when needed.

    [sebastian.hesselbarth@gmail.com: fix drivers/dma/mv_xor]
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Signed-off-by: Sebastian Hesselbarth
    Cc: Vinod Koul
    Acked-by: Dan Williams
    Cc: Tomasz Figa
    Cc: Herbert Xu
    Cc: Olof Johansson
    Cc: Kevin Hilman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     

09 Sep, 2009

1 commit

  • Some engines optimize operation by reading ahead in the descriptor chain
    such that descriptor2 may start execution before descriptor1 completes.
    If descriptor2 depends on the result from descriptor1 then a fence is
    required (on descriptor2) to disable this optimization. The async_tx
    api could implicitly identify dependencies via the 'depend_tx'
    parameter, but that would constrain cases where the dependency chain
    only specifies a completion order rather than a data dependency. So,
    provide an ASYNC_TX_FENCE to explicitly identify data dependencies.

    Signed-off-by: Dan Williams

    Dan Williams
     

30 Aug, 2009

4 commits

  • async_raid6_2data_recov() recovers two data disk failures

    async_raid6_datap_recov() recovers a data disk and the P disk

    These routines are a port of the synchronous versions found in
    drivers/md/raid6recov.c. The primary difference is breaking out the xor
    operations into separate calls to async_xor. Two helper routines are
    introduced to perform scalar multiplication where needed.
    async_sum_product() multiplies two sources by scalar coefficients and
    then sums (xor) the result. async_mult() simply multiplies a single
    source by a scalar.

    This implemention also includes, in contrast to the original
    synchronous-only code, special case handling for the 4-disk and 5-disk
    array cases. In these situations the default N-disk algorithm will
    present 0-source or 1-source operations to dma devices. To cover for
    dma devices where the minimum source count is 2 we implement 4-disk and
    5-disk handling in the recovery code.

    [ Impact: asynchronous raid6 recovery routines for 2data and datap cases ]

    Cc: Yuri Tikhonov
    Cc: Ilya Yanok
    Cc: H. Peter Anvin
    Cc: David Woodhouse
    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • [ Based on an original patch by Yuri Tikhonov ]

    This adds support for doing asynchronous GF multiplication by adding
    two additional functions to the async_tx API:

    async_gen_syndrome() does simultaneous XOR and Galois field
    multiplication of sources.

    async_syndrome_val() validates the given source buffers against known P
    and Q values.

    When a request is made to run async_pq against more than the hardware
    maximum number of supported sources we need to reuse the previous
    generated P and Q values as sources into the next operation. Care must
    be taken to remove Q from P' and P from Q'. For example to perform a 5
    source pq op with hardware that only supports 4 sources at a time the
    following approach is taken:

    p, q = PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08}))
    p', q' = PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10}))

    p' = p + q + q + src4 = p + src4
    q' = {00}*p + {01}*q + {00}*q + {10}*src4 = q + {10}*src4

    Note: 4 is the minimum acceptable maxpq otherwise we punt to
    synchronous-software path.

    The DMA_PREP_CONTINUE flag indicates to the driver to reuse p and q as
    sources (in the above manner) and fill the remaining slots up to maxpq
    with the new sources/coefficients.

    Note1: Some devices have native support for P+Q continuation and can skip
    this extra work. Devices with this capability can advertise it with
    dma_set_maxpq. It is up to each driver how to handle the
    DMA_PREP_CONTINUE flag.

    Note2: The api supports disabling the generation of P when generating Q,
    this is ignored by the synchronous path but is implemented by some dma
    devices to save unnecessary writes. In this case the continuation
    algorithm is simplified to only reuse Q as a source.

    Cc: H. Peter Anvin
    Cc: David Woodhouse
    Signed-off-by: Yuri Tikhonov
    Signed-off-by: Ilya Yanok
    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • We currently walk the parent chain when waiting for a given tx to
    complete however this walk may race with the driver cleanup routine.
    The routines in async_raid6_recov.c may fall back to the synchronous
    path at any point so we need to be prepared to call async_tx_quiesce()
    (which calls dma_wait_for_async_tx). To remove the ->parent walk we
    guarantee that every time a dependency is attached ->issue_pending() is
    invoked, then we can simply poll the initial descriptor until
    completion.

    This also allows for a lighter weight 'issue pending' implementation as
    there is no longer a requirement to iterate through all the channels'
    ->issue_pending() routines as long as operations have been submitted in
    an ordered chain. async_tx_issue_pending() is added for this case.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Replace the flat zero_sum_result with a collection of flags to contain
    the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
    solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead
    of DMA_ since these flags will be used on non-dma-zero-sum enabled
    platforms.

    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     

04 Jun, 2009

2 commits

  • Prepare the api for the arrival of a new parameter, 'scribble'. This
    will allow callers to identify scratchpad memory for dma address or page
    address conversions. As this adds yet another parameter, take this
    opportunity to convert the common submission parameters (flags,
    dependency, callback, and callback argument) into an object that is
    passed by reference.

    Also, take this opportunity to fix up the kerneldoc and add notes about
    the relevant ASYNC_TX_* flags for each routine.

    [ Impact: moves api pass-by-value parameters to a pass-by-reference struct ]

    Signed-off-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • In support of inter-channel chaining async_tx utilizes an ack flag to
    gate whether a dependent operation can be chained to another. While the
    flag is not set the chain can be considered open for appending. Setting
    the ack flag closes the chain and flags the descriptor for garbage
    collection. The ASYNC_TX_DEP_ACK flag essentially means "close the
    chain after adding this dependency". Since each operation can only have
    one child the api now implicitly sets the ack flag at dependency
    submission time. This removes an unnecessary management burden from
    clients of the api.

    [ Impact: clean up and enforce one dependency per operation ]

    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     

09 Apr, 2009

1 commit

  • 'zero_sum' does not properly describe the operation of generating parity
    and checking that it validates against an existing buffer. Change the
    name of the operation to 'val' (for 'validate'). This is in
    anticipation of the p+q case where it is a requirement to identify the
    target parity buffers separately from the source buffers, because the
    target parity buffers will not have corresponding pq coefficients.

    Reviewed-by: Andre Noll
    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     

26 Mar, 2009

1 commit


07 Jan, 2009

1 commit

  • async_tx and net_dma each have open-coded versions of issue_pending_all,
    so provide a common routine in dmaengine.

    The implementation needs to walk the global device list, so implement
    rcu to allow dma_issue_pending_all to run lockless. Clients protect
    themselves from channel removal events by holding a dmaengine reference.

    Reviewed-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     

06 Jan, 2009

1 commit

  • async_tx.ko is a consumer of dma channels. A circular dependency arises
    if modules in drivers/dma rely on common code in async_tx.ko. It
    prevents either module from being unloaded.

    Move dma_wait_for_async_tx and async_tx_run_dependencies to dmaeninge.o
    where they should have been from the beginning.

    Reviewed-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     

18 Jul, 2008

2 commits


07 Feb, 2008

2 commits


20 Jul, 2007

1 commit

  • Andrew Morton:
    [async_memcpy] is very wrong if both ASYNC_TX_KMAP_DST and
    ASYNC_TX_KMAP_SRC can ever be set. We'll end up using the same kmap
    slot for both src add dest and we get either corrupted data or a BUG.

    Evgeniy Polyakov:
    Btw, shouldn't it always be kmap_atomic() even if flag is not set.
    That pages are usual one returned by alloc_page().

    So fix the usage of kmap_atomic and kill the ASYNC_TX_KMAP_DST and
    ASYNC_TX_KMAP_SRC flags.

    Cc: Andrew Morton
    Cc: Evgeniy Polyakov
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

13 Jul, 2007

1 commit

  • The async_tx api provides methods for describing a chain of asynchronous
    bulk memory transfers/transforms with support for inter-transactional
    dependencies. It is implemented as a dmaengine client that smooths over
    the details of different hardware offload engine implementations. Code
    that is written to the api can optimize for asynchronous operation and the
    api will fit the chain of operations to the available offload resources.

    I imagine that any piece of ADMA hardware would register with the
    'async_*' subsystem, and a call to async_X would be routed as
    appropriate, or be run in-line. - Neil Brown

    async_tx exploits the capabilities of struct dma_async_tx_descriptor to
    provide an api of the following general format:

    struct dma_async_tx_descriptor *
    async_(..., struct dma_async_tx_descriptor *depend_tx,
    dma_async_tx_callback cb_fn, void *cb_param)
    {
    struct dma_chan *chan = async_tx_find_channel(depend_tx, );
    struct dma_device *device = chan ? chan->device : NULL;
    int int_en = cb_fn ? 1 : 0;
    struct dma_async_tx_descriptor *tx = device ?
    device->device_prep_dma_(chan, len, int_en) : NULL;

    if (tx) { /* run asynchronously */
    ...
    tx->tx_set_dest(addr, tx, index);
    ...
    tx->tx_set_src(addr, tx, index);
    ...
    async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
    } else { /* run synchronously */
    ...

    ...
    async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param);
    }

    return tx;
    }

    async_tx_find_channel() returns a capable channel from its pool. The
    channel pool is organized as a per-cpu array of channel pointers. The
    async_tx_rebalance() routine is tasked with managing these arrays. In the
    uniprocessor case async_tx_rebalance() tries to spread responsibility
    evenly over channels of similar capabilities. For example if there are two
    copy+xor channels, one will handle copy operations and the other will
    handle xor. In the SMP case async_tx_rebalance() attempts to spread the
    operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor
    channel0 while cpu1 gets copy channel 1 and xor channel 1. When a
    dependency is specified async_tx_find_channel defaults to keeping the
    operation on the same channel. A xor->copy->xor chain will stay on one
    channel if it supports both operation types, otherwise the transaction will
    transition between a copy and a xor resource.

    Currently the raid5 implementation in the MD raid456 driver has been
    converted to the async_tx api. A driver for the offload engines on the
    Intel Xscale series of I/O processors, iop-adma, is provided in a later
    commit. With the iop-adma driver and async_tx, raid456 is able to offload
    copy, xor, and xor-zero-sum operations to hardware engines.

    On iop342 tiobench showed higher throughput for sequential writes (20 - 30%
    improvement) and sequential reads to a degraded array (40 - 55%
    improvement). For the other cases performance was roughly equal, +/- a few
    percentage points. On a x86-smp platform the performance of the async_tx
    implementation (in synchronous mode) was also +/- a few percentage points
    of the original implementation. According to 'top' on iop342 CPU
    utilization drops from ~50% to ~15% during a 'resync' while the speed
    according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s.

    The tiobench command line used for testing was: tiobench --size 2048
    --block 4096 --block 131072 --dir /mnt/raid --numruns 5
    * iop342 had 1GB of memory available

    Details:
    * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making
    async_tx_find_channel a static inline routine that always returns NULL
    * when a callback is specified for a given transaction an interrupt will
    fire at operation completion time and the callback will occur in a
    tasklet. if the the channel does not support interrupts then a live
    polling wait will be performed
    * the api is written as a dmaengine client that requests all available
    channels
    * In support of dependencies the api implicitly schedules channel-switch
    interrupts. The interrupt triggers the cleanup tasklet which causes
    pending operations to be scheduled on the next channel
    * Xor engines treat an xor destination address differently than a software
    xor routine. To the software routine the destination address is an implied
    source, whereas engines treat it as a write-only destination. This patch
    modifies the xor_blocks routine to take a an explicit destination address
    to mirror the hardware.

    Changelog:
    * fixed a leftover debug print
    * don't allow callbacks in async_interrupt_cond
    * fixed xor_block changes
    * fixed usage of ASYNC_TX_XOR_DROP_DEST
    * drop dma mapping methods, suggested by Chris Leech
    * printk warning fixups from Andrew Morton
    * don't use inline in C files, Adrian Bunk
    * select the API when MD is enabled
    * BUG_ON xor source counts
    Signed-off-by: Dan Williams
    Acked-By: NeilBrown

    Dan Williams