19 May, 2008

1 commit

  • Move rcu-protected lists from list.h into a new header file rculist.h.

    This is done because list are a very used primitive structure all over the
    kernel and it's currently impossible to include other header files in this
    list.h without creating some circular dependencies.

    For example, list.h implements rcu-protected list and uses rcu_dereference()
    without including rcupdate.h. It actually compiles because users of
    rcu_dereference() are macros. Others RCU functions could be used too but
    aren't probably because of this.

    Therefore this patch creates rculist.h which includes rcupdates without to
    many changes/troubles.

    Signed-off-by: Franck Bui-Huu
    Acked-by: Paul E. McKenney
    Acked-by: Josh Triplett
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Franck Bui-Huu
     

18 Apr, 2008

2 commits

  • 'ack' is currently a simple integer that flags whether or not a client is done
    touching fields in the given descriptor. It is effectively just a single bit
    of information. Converting this to a flags parameter allows the other bits to
    be put to use to control completion actions, like dma-unmap, and capture
    results, like xor-zero-sum == 0.

    Changes are one of:
    1/ convert all open-coded ->ack manipulations to use async_tx_ack
    and async_tx_test_ack.
    2/ set the ack bit at prep time where possible
    3/ make drivers store the flags at prep time
    4/ add flags to the device_prep_dma_interrupt prototype

    Acked-by: Maciej Sosnowski
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Shrink struct dma_async_tx_descriptor and introduce
    async_tx_channel_switch to properly inject a channel switch interrupt in
    the descriptor stream. This simplifies the locking model as drivers no
    longer need to handle dma_async_tx_descriptor.lock.

    Acked-by: Shannon Nelson
    Signed-off-by: Dan Williams

    Dan Williams
     

19 Mar, 2008

1 commit


14 Mar, 2008

1 commit


07 Feb, 2008

6 commits

  • The source and destination addresses are included to allow channel
    selection based on address alignment.

    Signed-off-by: Dan Williams
    Reviewed-by: Haavard Skinnemoen

    Dan Williams
     
  • Pass a full set of flags to drivers' per-operation 'prep' routines.
    Currently the only flag passed is DMA_PREP_INTERRUPT. The expectation is
    that arch-specific async_tx_find_channel() implementations can exploit this
    capability to find the best channel for an operation.

    Signed-off-by: Dan Williams
    Acked-by: Shannon Nelson
    Reviewed-by: Haavard Skinnemoen

    Dan Williams
     
  • The tx_set_src and tx_set_dest methods were originally implemented to allow
    an array of addresses to be passed down from async_xor to the dmaengine
    driver while minimizing stack overhead. Removing these methods allows
    drivers to have all transaction parameters available at 'prep' time, saves
    two function pointers in struct dma_async_tx_descriptor, and reduces the
    number of indirect branches..

    A consequence of moving this data to the 'prep' routine is that
    multi-source routines like async_xor need temporary storage to convert an
    array of linear addresses into an array of dma addresses. In order to keep
    the same stack footprint of the previous implementation the input array is
    reused as storage for the dma addresses. This requires that
    sizeof(dma_addr_t) be less than or equal to sizeof(void *). As a
    consequence CONFIG_DMADEVICES now depends on !CONFIG_HIGHMEM64G. It also
    requires that drivers be able to make descriptor resources available when
    the 'prep' routine is polled.

    Signed-off-by: Dan Williams
    Acked-by: Shannon Nelson

    Dan Williams
     
  • Remove the unused ASYNC_TX_ASSUME_COHERENT flag. Async_tx is
    meant to hide the difference between asynchronous hardware and synchronous
    software operations, this flag requires clients to understand cache
    coherency consequences of the async path.

    Signed-off-by: Dan Williams
    Reviewed-by: Haavard Skinnemoen

    Dan Williams
     
  • single list_head variable initialized with LIST_HEAD_INIT could almost
    always can be replaced with LIST_HEAD declaration, this shrinks the code
    and looks better.

    Signed-off-by: Denis Cheng
    Signed-off-by: Dan Williams

    Denis Cheng
     
  • do_async_xor must be compiled away on !HAS_DMA archs.

    Signed-off-by: Dan Williams
    Acked-by: Cornelia Huck

    Dan Williams
     

25 Sep, 2007

1 commit

  • Fix dma_wait_for_async_tx to not loop forever in the case where a
    dependency chain is longer than two entries. This condition will not
    happen with current in-kernel drivers, but fix it for future drivers.

    Found-by: Saeed Bishara
    Signed-off-by: Dan Williams

    Dan Williams
     

20 Jul, 2007

1 commit

  • Andrew Morton:
    [async_memcpy] is very wrong if both ASYNC_TX_KMAP_DST and
    ASYNC_TX_KMAP_SRC can ever be set. We'll end up using the same kmap
    slot for both src add dest and we get either corrupted data or a BUG.

    Evgeniy Polyakov:
    Btw, shouldn't it always be kmap_atomic() even if flag is not set.
    That pages are usual one returned by alloc_page().

    So fix the usage of kmap_atomic and kill the ASYNC_TX_KMAP_DST and
    ASYNC_TX_KMAP_SRC flags.

    Cc: Andrew Morton
    Cc: Evgeniy Polyakov
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

13 Jul, 2007

1 commit

  • The async_tx api provides methods for describing a chain of asynchronous
    bulk memory transfers/transforms with support for inter-transactional
    dependencies. It is implemented as a dmaengine client that smooths over
    the details of different hardware offload engine implementations. Code
    that is written to the api can optimize for asynchronous operation and the
    api will fit the chain of operations to the available offload resources.

    I imagine that any piece of ADMA hardware would register with the
    'async_*' subsystem, and a call to async_X would be routed as
    appropriate, or be run in-line. - Neil Brown

    async_tx exploits the capabilities of struct dma_async_tx_descriptor to
    provide an api of the following general format:

    struct dma_async_tx_descriptor *
    async_(..., struct dma_async_tx_descriptor *depend_tx,
    dma_async_tx_callback cb_fn, void *cb_param)
    {
    struct dma_chan *chan = async_tx_find_channel(depend_tx, );
    struct dma_device *device = chan ? chan->device : NULL;
    int int_en = cb_fn ? 1 : 0;
    struct dma_async_tx_descriptor *tx = device ?
    device->device_prep_dma_(chan, len, int_en) : NULL;

    if (tx) { /* run asynchronously */
    ...
    tx->tx_set_dest(addr, tx, index);
    ...
    tx->tx_set_src(addr, tx, index);
    ...
    async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
    } else { /* run synchronously */
    ...

    ...
    async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param);
    }

    return tx;
    }

    async_tx_find_channel() returns a capable channel from its pool. The
    channel pool is organized as a per-cpu array of channel pointers. The
    async_tx_rebalance() routine is tasked with managing these arrays. In the
    uniprocessor case async_tx_rebalance() tries to spread responsibility
    evenly over channels of similar capabilities. For example if there are two
    copy+xor channels, one will handle copy operations and the other will
    handle xor. In the SMP case async_tx_rebalance() attempts to spread the
    operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor
    channel0 while cpu1 gets copy channel 1 and xor channel 1. When a
    dependency is specified async_tx_find_channel defaults to keeping the
    operation on the same channel. A xor->copy->xor chain will stay on one
    channel if it supports both operation types, otherwise the transaction will
    transition between a copy and a xor resource.

    Currently the raid5 implementation in the MD raid456 driver has been
    converted to the async_tx api. A driver for the offload engines on the
    Intel Xscale series of I/O processors, iop-adma, is provided in a later
    commit. With the iop-adma driver and async_tx, raid456 is able to offload
    copy, xor, and xor-zero-sum operations to hardware engines.

    On iop342 tiobench showed higher throughput for sequential writes (20 - 30%
    improvement) and sequential reads to a degraded array (40 - 55%
    improvement). For the other cases performance was roughly equal, +/- a few
    percentage points. On a x86-smp platform the performance of the async_tx
    implementation (in synchronous mode) was also +/- a few percentage points
    of the original implementation. According to 'top' on iop342 CPU
    utilization drops from ~50% to ~15% during a 'resync' while the speed
    according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s.

    The tiobench command line used for testing was: tiobench --size 2048
    --block 4096 --block 131072 --dir /mnt/raid --numruns 5
    * iop342 had 1GB of memory available

    Details:
    * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making
    async_tx_find_channel a static inline routine that always returns NULL
    * when a callback is specified for a given transaction an interrupt will
    fire at operation completion time and the callback will occur in a
    tasklet. if the the channel does not support interrupts then a live
    polling wait will be performed
    * the api is written as a dmaengine client that requests all available
    channels
    * In support of dependencies the api implicitly schedules channel-switch
    interrupts. The interrupt triggers the cleanup tasklet which causes
    pending operations to be scheduled on the next channel
    * Xor engines treat an xor destination address differently than a software
    xor routine. To the software routine the destination address is an implied
    source, whereas engines treat it as a write-only destination. This patch
    modifies the xor_blocks routine to take a an explicit destination address
    to mirror the hardware.

    Changelog:
    * fixed a leftover debug print
    * don't allow callbacks in async_interrupt_cond
    * fixed xor_block changes
    * fixed usage of ASYNC_TX_XOR_DROP_DEST
    * drop dma mapping methods, suggested by Chris Leech
    * printk warning fixups from Andrew Morton
    * don't use inline in C files, Adrian Bunk
    * select the API when MD is enabled
    * BUG_ON xor source counts
    Signed-off-by: Dan Williams
    Acked-By: NeilBrown

    Dan Williams