13 Jul, 2007

5 commits

  • Cc: John Magolan
    Signed-off-by: Shannon Nelson
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The Intel(R) IOP series of i/o processors integrate an Xscale core with
    raid acceleration engines. The capabilities per platform are:

    iop219:
    (2) copy engines
    iop321:
    (2) copy engines
    (1) xor and block fill engine
    iop33x:
    (2) copy and crc32c engines
    (1) xor, xor zero sum, pq, pq zero sum, and block fill engine
    iop34x (iop13xx):
    (2) copy, crc32c, xor, xor zero sum, and block fill engines
    (1) copy, crc32c, xor, xor zero sum, pq, pq zero sum, and block fill engine

    The driver supports the features of the async_tx api:
    * asynchronous notification of operation completion
    * implicit (interupt triggered) handling of inter-channel transaction
    dependencies

    The driver adapts to the platform it is running by two methods.
    1/ #include which defines the hardware specific
    iop_chan_* and iop_desc_* routines as a series of static inline
    functions
    2/ The private platform data attached to the platform_device defines the
    capabilities of the channels

    20070626: Callbacks are run in a tasklet. Given the recent discussion on
    LKML about killing tasklets in favor of workqueues I did a quick conversion
    of the driver. Raid5 resync performance dropped from 50MB/s to 30MB/s, so
    the tasklet implementation remains until a generic softirq interface is
    available.

    Changelog:
    * fixed a slot allocation bug in do_iop13xx_adma_xor that caused too few
    slots to be requested eventually leading to data corruption
    * enabled the slot allocation routine to attempt to free slots before
    returning -ENOMEM
    * switched the cleanup routine to solely use the software chain and the
    status register to determine if a descriptor is complete. This is
    necessary to support other IOP engines that do not have status writeback
    capability
    * make the driver iop generic
    * modified the allocation routines to understand allocating a group of
    slots for a single operation
    * added a null xor initialization operation for the xor only channel on
    iop3xx
    * support xor operations on buffers larger than the hardware maximum
    * split the do_* routines into separate prep, src/dest set, submit stages
    * added async_tx support (dependent operations initiation at cleanup time)
    * simplified group handling
    * added interrupt support (callbacks via tasklets)
    * brought the pending depth inline with ioat (i.e. 4 descriptors)
    * drop dma mapping methods, suggested by Chris Leech
    * don't use inline in C files, Adrian Bunk
    * remove static tasklet declarations
    * make iop_adma_alloc_slots easier to read and remove chances for a
    corrupted descriptor chain
    * fix locking bug in iop_adma_alloc_chan_resources, Benjamin Herrenschmidt
    * convert capabilities over to dma_cap_mask_t
    * fixup sparse warnings
    * add descriptor flush before iop_chan_enable
    * checkpatch.pl fixes
    * gpl v2 only correction
    * move set_src, set_dest, submit to async_tx methods
    * move group_list and phys to async_tx

    Cc: Russell King
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The async_tx api provides methods for describing a chain of asynchronous
    bulk memory transfers/transforms with support for inter-transactional
    dependencies. It is implemented as a dmaengine client that smooths over
    the details of different hardware offload engine implementations. Code
    that is written to the api can optimize for asynchronous operation and the
    api will fit the chain of operations to the available offload resources.

    I imagine that any piece of ADMA hardware would register with the
    'async_*' subsystem, and a call to async_X would be routed as
    appropriate, or be run in-line. - Neil Brown

    async_tx exploits the capabilities of struct dma_async_tx_descriptor to
    provide an api of the following general format:

    struct dma_async_tx_descriptor *
    async_(..., struct dma_async_tx_descriptor *depend_tx,
    dma_async_tx_callback cb_fn, void *cb_param)
    {
    struct dma_chan *chan = async_tx_find_channel(depend_tx, );
    struct dma_device *device = chan ? chan->device : NULL;
    int int_en = cb_fn ? 1 : 0;
    struct dma_async_tx_descriptor *tx = device ?
    device->device_prep_dma_(chan, len, int_en) : NULL;

    if (tx) { /* run asynchronously */
    ...
    tx->tx_set_dest(addr, tx, index);
    ...
    tx->tx_set_src(addr, tx, index);
    ...
    async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
    } else { /* run synchronously */
    ...

    ...
    async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param);
    }

    return tx;
    }

    async_tx_find_channel() returns a capable channel from its pool. The
    channel pool is organized as a per-cpu array of channel pointers. The
    async_tx_rebalance() routine is tasked with managing these arrays. In the
    uniprocessor case async_tx_rebalance() tries to spread responsibility
    evenly over channels of similar capabilities. For example if there are two
    copy+xor channels, one will handle copy operations and the other will
    handle xor. In the SMP case async_tx_rebalance() attempts to spread the
    operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor
    channel0 while cpu1 gets copy channel 1 and xor channel 1. When a
    dependency is specified async_tx_find_channel defaults to keeping the
    operation on the same channel. A xor->copy->xor chain will stay on one
    channel if it supports both operation types, otherwise the transaction will
    transition between a copy and a xor resource.

    Currently the raid5 implementation in the MD raid456 driver has been
    converted to the async_tx api. A driver for the offload engines on the
    Intel Xscale series of I/O processors, iop-adma, is provided in a later
    commit. With the iop-adma driver and async_tx, raid456 is able to offload
    copy, xor, and xor-zero-sum operations to hardware engines.

    On iop342 tiobench showed higher throughput for sequential writes (20 - 30%
    improvement) and sequential reads to a degraded array (40 - 55%
    improvement). For the other cases performance was roughly equal, +/- a few
    percentage points. On a x86-smp platform the performance of the async_tx
    implementation (in synchronous mode) was also +/- a few percentage points
    of the original implementation. According to 'top' on iop342 CPU
    utilization drops from ~50% to ~15% during a 'resync' while the speed
    according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s.

    The tiobench command line used for testing was: tiobench --size 2048
    --block 4096 --block 131072 --dir /mnt/raid --numruns 5
    * iop342 had 1GB of memory available

    Details:
    * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making
    async_tx_find_channel a static inline routine that always returns NULL
    * when a callback is specified for a given transaction an interrupt will
    fire at operation completion time and the callback will occur in a
    tasklet. if the the channel does not support interrupts then a live
    polling wait will be performed
    * the api is written as a dmaengine client that requests all available
    channels
    * In support of dependencies the api implicitly schedules channel-switch
    interrupts. The interrupt triggers the cleanup tasklet which causes
    pending operations to be scheduled on the next channel
    * Xor engines treat an xor destination address differently than a software
    xor routine. To the software routine the destination address is an implied
    source, whereas engines treat it as a write-only destination. This patch
    modifies the xor_blocks routine to take a an explicit destination address
    to mirror the hardware.

    Changelog:
    * fixed a leftover debug print
    * don't allow callbacks in async_interrupt_cond
    * fixed xor_block changes
    * fixed usage of ASYNC_TX_XOR_DROP_DEST
    * drop dma mapping methods, suggested by Chris Leech
    * printk warning fixups from Andrew Morton
    * don't use inline in C files, Adrian Bunk
    * select the API when MD is enabled
    * BUG_ON xor source counts
    Signed-off-by: Dan Williams
    Acked-By: NeilBrown

    Dan Williams
     
  • The current implementation assumes that a channel will only be used by one
    client at a time. In order to enable channel sharing the dmaengine core is
    changed to a model where clients subscribe to channel-available-events.
    Instead of tracking how many channels a client wants and how many it has
    received the core just broadcasts the available channels and lets the
    clients optionally take a reference. The core learns about the clients'
    needs at dma_event_callback time.

    In support of multiple operation types, clients can specify a capability
    mask to only be notified of channels that satisfy a certain set of
    capabilities.

    Changelog:
    * removed DMA_TX_ARRAY_INIT, no longer needed
    * dma_client_chan_free -> dma_chan_release: switch to global reference
    counting only at device unregistration time, before it was also happening
    at client unregistration time
    * clients now return dma_state_client to dmaengine (ack, dup, nak)
    * checkpatch.pl fixes
    * fixup merge with git-ioat

    Cc: Chris Leech
    Signed-off-by: Shannon Nelson
    Signed-off-by: Dan Williams
    Acked-by: David S. Miller

    Dan Williams
     
  • The current dmaengine interface defines mutliple routines per operation,
    i.e. dma_async_memcpy_buf_to_buf, dma_async_memcpy_buf_to_page etc. Adding
    more operation types (xor, crc, etc) to this model would result in an
    unmanageable number of method permutations.

    Are we really going to add a set of hooks for each DMA engine
    whizbang feature?
    - Jeff Garzik

    The descriptor creation process is refactored using the new common
    dma_async_tx_descriptor structure. Instead of per driver
    do___to_ methods, drivers integrate
    dma_async_tx_descriptor into their private software descriptor and then
    define a 'prep' routine per operation. The prep routine allocates a
    descriptor and ensures that the tx_set_src, tx_set_dest, tx_submit routines
    are valid. Descriptor creation and submission becomes:

    struct dma_device *dev;
    struct dma_chan *chan;
    struct dma_async_tx_descriptor *tx;

    tx = dev->device_prep_dma_(chan, len, int_flag)
    tx->tx_set_src(dma_addr_t, tx, index /* for multi-source ops */)
    tx->tx_set_dest(dma_addr_t, tx, index)
    tx->tx_submit(tx)

    In addition to the refactoring, dma_async_tx_descriptor also lays the
    groundwork for definining cross-channel-operation dependencies, and a
    callback facility for asynchronous notification of operation completion.

    Changelog:
    * drop dma mapping methods, suggested by Chris Leech
    * fix ioat_dma_dependency_added, also caught by Andrew Morton
    * fix dma_sync_wait, change from Andrew Morton
    * uninline large functions, change from Andrew Morton
    * add tx->callback = NULL to dmaengine calls to interoperate with async_tx
    calls
    * hookup ioat_tx_submit
    * convert channel capabilities to a 'cpumask_t like' bitmap
    * removed DMA_TX_ARRAY_INIT, no longer needed
    * checkpatch.pl fixes
    * make set_src, set_dest, and tx_submit descriptor specific methods
    * fixup git-ioat merge
    * move group_list and phys to dma_async_tx_descriptor

    Cc: Jeff Garzik
    Cc: Chris Leech
    Signed-off-by: Shannon Nelson
    Signed-off-by: Dan Williams
    Acked-by: David S. Miller

    Dan Williams
     

12 Jul, 2007

5 commits


29 Jun, 2007

1 commit

  • Rename struct pci_driver data so that false section mismatch warnings won't
    be produced.

    Sam, ISTM that depending on variable names is the weakest & worst part of
    modpost section checking. Should __init_refok work here? I got build
    errors when I tried to use it, probably because the struct pci_driver probe
    and remove methods are not marked "__init_refok".

    WARNING: drivers/dma/ioatdma.o(.data+0x10): Section mismatch: reference to .init.text: (between 'ioat_pci_drv' and 'ioat_pci_tbl')
    WARNING: drivers/dma/ioatdma.o(.data+0x14): Section mismatch: reference to .exit.text: (between 'ioat_pci_drv' and 'ioat_pci_tbl')

    Signed-off-by: Randy Dunlap
    Acked-by: Chris Leech
    Cc: Sam Ravnborg
    Cc: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

10 May, 2007

1 commit


17 Mar, 2007

1 commit

  • This removes several pointless exports from drivers/dma/dmaengine.c; the
    dma_async_memcpy_*() functions are inlined by so those
    exports are inappropriate.

    It also moves the existing EXPORT_SYMBOL declarations next to their functions,
    so it's now trivial to confirm one-to-one correspondence between exports and
    nonstatic symbols.

    Signed-off-by: David Brownell
    Signed-off-by: Dan Williams
    Acked-by: Chris Leech
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Brownell
     

08 Dec, 2006

1 commit


11 Oct, 2006

1 commit


05 Oct, 2006

1 commit

  • Maintain a per-CPU global "struct pt_regs *" variable which can be used instead
    of passing regs around manually through all ~1800 interrupt handlers in the
    Linux kernel.

    The regs pointer is used in few places, but it potentially costs both stack
    space and code to pass it around. On the FRV arch, removing the regs parameter
    from all the genirq function results in a 20% speed up of the IRQ exit path
    (ie: from leaving timer_interrupt() to leaving do_IRQ()).

    Where appropriate, an arch may override the generic storage facility and do
    something different with the variable. On FRV, for instance, the address is
    maintained in GR28 at all times inside the kernel as part of general exception
    handling.

    Having looked over the code, it appears that the parameter may be handed down
    through up to twenty or so layers of functions. Consider a USB character
    device attached to a USB hub, attached to a USB controller that posts its
    interrupts through a cascaded auxiliary interrupt controller. A character
    device driver may want to pass regs to the sysrq handler through the input
    layer which adds another few layers of parameter passing.

    I've build this code with allyesconfig for x86_64 and i386. I've runtested the
    main part of the code on FRV and i386, though I can't test most of the drivers.
    I've also done partial conversion for powerpc and MIPS - these at least compile
    with minimal configurations.

    This will affect all archs. Mostly the changes should be relatively easy.
    Take do_IRQ(), store the regs pointer at the beginning, saving the old one:

    struct pt_regs *old_regs = set_irq_regs(regs);

    And put the old one back at the end:

    set_irq_regs(old_regs);

    Don't pass regs through to generic_handle_irq() or __do_IRQ().

    In timer_interrupt(), this sort of change will be necessary:

    - update_process_times(user_mode(regs));
    - profile_tick(CPU_PROFILING, regs);
    + update_process_times(user_mode(get_irq_regs()));
    + profile_tick(CPU_PROFILING);

    I'd like to move update_process_times()'s use of get_irq_regs() into itself,
    except that i386, alone of the archs, uses something other than user_mode().

    Some notes on the interrupt handling in the drivers:

    (*) input_dev() is now gone entirely. The regs pointer is no longer stored in
    the input_dev struct.

    (*) finish_unlinks() in drivers/usb/host/ohci-q.c needs checking. It does
    something different depending on whether it's been supplied with a regs
    pointer or not.

    (*) Various IRQ handler function pointers have been moved to type
    irq_handler_t.

    Signed-Off-By: David Howells
    (cherry picked from 1b16e7ac850969f38b375e511e3fa2f474a33867 commit)

    David Howells
     

22 Jul, 2006

1 commit


04 Jul, 2006

4 commits


03 Jul, 2006

1 commit


26 Jun, 2006

1 commit


18 Jun, 2006

7 commits