Eric Lee / smarc-fsl-linux-kernel

22 Apr, 2015

1 commit

584acdd49 md/raid5: activate raid6 rmw feature ... Browse Code »

Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.

1) Enable xor_syndrome() in the async layer.

2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.

3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.

4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.

5) Adapt the several places where we ignored Q handling up to now.

Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)

bsize rmw_level=1 rmw_level=0 rmw_level=1 rmw_level=0
skip_copy=1 skip_copy=1 skip_copy=0 skip_copy=0
4K 115 KB/s 141 KB/s 165 KB/s 140 KB/s
8K 225 KB/s 275 KB/s 324 KB/s 274 KB/s
16K 434 KB/s 536 KB/s 640 KB/s 534 KB/s
32K 751 KB/s 1,051 KB/s 1,234 KB/s 1,045 KB/s
64K 1,339 KB/s 1,958 KB/s 2,282 KB/s 1,962 KB/s
128K 2,673 KB/s 3,862 KB/s 4,113 KB/s 3,898 KB/s
256K 7,685 KB/s 7,539 KB/s 7,557 KB/s 7,638 KB/s
512K 19,556 KB/s 19,558 KB/s 19,652 KB/s 19,688 Kb/s

Signed-off-by: Markus Stockhausen
Signed-off-by: NeilBrown

Markus Stockhausen
2015-04-22 06:00:42 +0800

04 Jul, 2013

1 commit

48a9db462 drivers/dma: remove unused support for MEMSET operations ... Browse Code »

There have never been any real users of MEMSET operations since they
have been introduced in January 2007 by commit 7405f74badf4 ("dmaengine:
refactor dmaengine around dma_async_tx_descriptor"). Therefore remove
support for them for now, it can be always brought back when needed.

[sebastian.hesselbarth@gmail.com: fix drivers/dma/mv_xor]
Signed-off-by: Bartlomiej Zolnierkiewicz
Signed-off-by: Kyungmin Park
Signed-off-by: Sebastian Hesselbarth
Cc: Vinod Koul
Acked-by: Dan Williams
Cc: Tomasz Figa
Cc: Herbert Xu
Cc: Olof Johansson
Cc: Kevin Hilman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bartlomiej Zolnierkiewicz
2013-07-04 07:07:42 +0800

09 Sep, 2009

1 commit

0403e3827 dmaengine: add fence support ... Browse Code »

Some engines optimize operation by reading ahead in the descriptor chain
such that descriptor2 may start execution before descriptor1 completes.
If descriptor2 depends on the result from descriptor1 then a fence is
required (on descriptor2) to disable this optimization. The async_tx
api could implicitly identify dependencies via the 'depend_tx'
parameter, but that would constrain cases where the dependency chain
only specifies a completion order rather than a data dependency. So,
provide an ASYNC_TX_FENCE to explicitly identify data dependencies.

Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:42:50 +0800

30 Aug, 2009

4 commits

0a82a6239 async_tx: add support for asynchronous RAID6 recovery operations ... Browse Code »

async_raid6_2data_recov() recovers two data disk failures

async_raid6_datap_recov() recovers a data disk and the P disk

These routines are a port of the synchronous versions found in
drivers/md/raid6recov.c. The primary difference is breaking out the xor
operations into separate calls to async_xor. Two helper routines are
introduced to perform scalar multiplication where needed.
async_sum_product() multiplies two sources by scalar coefficients and
then sums (xor) the result. async_mult() simply multiplies a single
source by a scalar.

This implemention also includes, in contrast to the original
synchronous-only code, special case handling for the 4-disk and 5-disk
array cases. In these situations the default N-disk algorithm will
present 0-source or 1-source operations to dma devices. To cover for
dma devices where the minimum source count is 2 we implement 4-disk and
5-disk handling in the recovery code.

[ Impact: asynchronous raid6 recovery routines for 2data and datap cases ]

Cc: Yuri Tikhonov
Cc: Ilya Yanok
Cc: H. Peter Anvin
Cc: David Woodhouse
Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:27 +0800
b2f46fd8e async_tx: add support for asynchronous GF multiplication ... Browse Code »

[ Based on an original patch by Yuri Tikhonov ]

This adds support for doing asynchronous GF multiplication by adding
two additional functions to the async_tx API:

async_gen_syndrome() does simultaneous XOR and Galois field
multiplication of sources.

async_syndrome_val() validates the given source buffers against known P
and Q values.

When a request is made to run async_pq against more than the hardware
maximum number of supported sources we need to reuse the previous
generated P and Q values as sources into the next operation. Care must
be taken to remove Q from P' and P from Q'. For example to perform a 5
source pq op with hardware that only supports 4 sources at a time the
following approach is taken:

p, q = PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08}))
p', q' = PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10}))

p' = p + q + q + src4 = p + src4
q' = {00}*p + {01}*q + {00}*q + {10}*src4 = q + {10}*src4

Note: 4 is the minimum acceptable maxpq otherwise we punt to
synchronous-software path.

The DMA_PREP_CONTINUE flag indicates to the driver to reuse p and q as
sources (in the above manner) and fill the remaining slots up to maxpq
with the new sources/coefficients.

Note1: Some devices have native support for P+Q continuation and can skip
this extra work. Devices with this capability can advertise it with
dma_set_maxpq. It is up to each driver how to handle the
DMA_PREP_CONTINUE flag.

Note2: The api supports disabling the generation of P when generating Q,
this is ignored by the synchronous path but is implemented by some dma
devices to save unnecessary writes. In this case the continuation
algorithm is simplified to only reuse Q as a source.

Cc: H. Peter Anvin
Cc: David Woodhouse
Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:27 +0800
95475e571 async_tx: remove walk of tx->parent chain in dma_wait_for_async_tx ... Browse Code »

We currently walk the parent chain when waiting for a given tx to
complete however this walk may race with the driver cleanup routine.
The routines in async_raid6_recov.c may fall back to the synchronous
path at any point so we need to be prepared to call async_tx_quiesce()
(which calls dma_wait_for_async_tx). To remove the ->parent walk we
guarantee that every time a dependency is attached ->issue_pending() is
invoked, then we can simply poll the initial descriptor until
completion.

This also allows for a lighter weight 'issue pending' implementation as
there is no longer a requirement to iterate through all the channels'
->issue_pending() routines as long as operations have been submitted in
an ordered chain. async_tx_issue_pending() is added for this case.

Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:27 +0800
ad283ea4a async_tx: add sum check flags ... Browse Code »

Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.

Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:26 +0800

04 Jun, 2009

2 commits

a08abd8ca async_tx: structify submission arguments, add scribble ... Browse Code »

Prepare the api for the arrival of a new parameter, 'scribble'. This
will allow callers to identify scratchpad memory for dma address or page
address conversions. As this adds yet another parameter, take this
opportunity to convert the common submission parameters (flags,
dependency, callback, and callback argument) into an object that is
passed by reference.

Also, take this opportunity to fix up the kerneldoc and add notes about
the relevant ASYNC_TX_* flags for each routine.

[ Impact: moves api pass-by-value parameters to a pass-by-reference struct ]

Signed-off-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-06-04 05:07:35 +0800
88ba2aa58 async_tx: kill ASYNC_TX_DEP_ACK flag ... Browse Code »

In support of inter-channel chaining async_tx utilizes an ack flag to
gate whether a dependent operation can be chained to another. While the
flag is not set the chain can be considered open for appending. Setting
the ack flag closes the chain and flags the descriptor for garbage
collection. The ASYNC_TX_DEP_ACK flag essentially means "close the
chain after adding this dependency". Since each operation can only have
one child the api now implicitly sets the ack flag at dependency
submission time. This removes an unnecessary management burden from
clients of the api.

[ Impact: clean up and enforce one dependency per operation ]

Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-06-04 05:07:34 +0800

09 Apr, 2009

1 commit

099f53cb5 async_tx: rename zero_sum to val ... Browse Code »

'zero_sum' does not properly describe the operation of generating parity
and checking that it validates against an existing buffer. Change the
name of the operation to 'val' (for 'validate'). This is in
anticipation of the p+q case where it is a requirement to identify the
target parity buffers separately from the source buffers, because the
target parity buffers will not have corresponding pq coefficients.

Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-04-09 05:28:37 +0800

26 Mar, 2009

1 commit

06164f319 async_tx: provide __async_inline for HAS_DMA=n archs ... Browse Code »

To allow an async_tx routine to be compiled away on HAS_DMA=n arch it
needs to be declared __always_inline otherwise the compiler may emit
code and cause a link error.

Signed-off-by: Dan Williams

Dan Williams
2009-03-26 00:13:25 +0800

07 Jan, 2009

1 commit

2ba05622b dmaengine: provide a common 'issue_pending_all' implementation ... Browse Code »

async_tx and net_dma each have open-coded versions of issue_pending_all,
so provide a common routine in dmaengine.

The implementation needs to walk the global device list, so implement
rcu to allow dma_issue_pending_all to run lockless. Clients protect
themselves from channel removal events by holding a dmaengine reference.

Reviewed-by: Andrew Morton
Signed-off-by: Dan Williams

Dan Williams
2009-01-07 02:38:14 +0800

06 Jan, 2009

1 commit

07f2211e4 dmaengine: remove dependency on async_tx ... Browse Code »

async_tx.ko is a consumer of dma channels. A circular dependency arises
if modules in drivers/dma rely on common code in async_tx.ko. It
prevents either module from being unloaded.

Move dma_wait_for_async_tx and async_tx_run_dependencies to dmaeninge.o
where they should have been from the beginning.

Reviewed-by: Andrew Morton
Signed-off-by: Dan Williams

Dan Williams
2009-01-06 09:10:19 +0800

18 Jul, 2008

2 commits

3dce01713 async_tx: remove depend_tx from async_tx_sync_epilog ... Browse Code »

All callers of async_tx_sync_epilog have called async_tx_quiesce on the
depend_tx, so async_tx_sync_epilog need only call the callback to
complete the operation.

Signed-off-by: Dan Williams

Dan Williams
2008-07-18 08:59:55 +0800
d2c52b798 async_tx: export async_tx_quiesce ... Browse Code »

Replace open coded "wait and acknowledge" instances with async_tx_quiesce.

Signed-off-by: Dan Williams

Dan Williams
2008-07-18 08:59:55 +0800

07 Feb, 2008

2 commits

47437b2c9 async_tx: allow architecture specific async_tx_find_channel implementations ... Browse Code »

The source and destination addresses are included to allow channel
selection based on address alignment.

Signed-off-by: Dan Williams
Reviewed-by: Haavard Skinnemoen

Dan Williams
2008-02-07 01:12:18 +0800
d909b3475 async_tx: kill ASYNC_TX_ASSUME_COHERENT ... Browse Code »

Remove the unused ASYNC_TX_ASSUME_COHERENT flag. Async_tx is
meant to hide the difference between asynchronous hardware and synchronous
software operations, this flag requires clients to understand cache
coherency consequences of the async path.

Signed-off-by: Dan Williams
Reviewed-by: Haavard Skinnemoen

Dan Williams
2008-02-07 01:12:17 +0800

20 Jul, 2007

1 commit

eb0645a8b async_tx: fix kmap_atomic usage in async_memcpy ... Browse Code »

Andrew Morton:
[async_memcpy] is very wrong if both ASYNC_TX_KMAP_DST and
ASYNC_TX_KMAP_SRC can ever be set. We'll end up using the same kmap
slot for both src add dest and we get either corrupted data or a BUG.

Evgeniy Polyakov:
Btw, shouldn't it always be kmap_atomic() even if flag is not set.
That pages are usual one returned by alloc_page().

So fix the usage of kmap_atomic and kill the ASYNC_TX_KMAP_DST and
ASYNC_TX_KMAP_SRC flags.

Cc: Andrew Morton
Cc: Evgeniy Polyakov
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2007-07-20 23:44:19 +0800

13 Jul, 2007

1 commit

9bc89cd82 async_tx: add the async_tx api ... Browse Code »

The async_tx api provides methods for describing a chain of asynchronous
bulk memory transfers/transforms with support for inter-transactional
dependencies. It is implemented as a dmaengine client that smooths over
the details of different hardware offload engine implementations. Code
that is written to the api can optimize for asynchronous operation and the
api will fit the chain of operations to the available offload resources.

I imagine that any piece of ADMA hardware would register with the
'async_*' subsystem, and a call to async_X would be routed as
appropriate, or be run in-line. - Neil Brown

async_tx exploits the capabilities of struct dma_async_tx_descriptor to
provide an api of the following general format:

struct dma_async_tx_descriptor *
async_(..., struct dma_async_tx_descriptor *depend_tx,
dma_async_tx_callback cb_fn, void *cb_param)
{
struct dma_chan *chan = async_tx_find_channel(depend_tx, );
struct dma_device *device = chan ? chan->device : NULL;
int int_en = cb_fn ? 1 : 0;
struct dma_async_tx_descriptor *tx = device ?
device->device_prep_dma_(chan, len, int_en) : NULL;

if (tx) { /* run asynchronously */
...
tx->tx_set_dest(addr, tx, index);
...
tx->tx_set_src(addr, tx, index);
...
async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param);
} else { /* run synchronously */
...

...
async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param);
}

return tx;
}

async_tx_find_channel() returns a capable channel from its pool. The
channel pool is organized as a per-cpu array of channel pointers. The
async_tx_rebalance() routine is tasked with managing these arrays. In the
uniprocessor case async_tx_rebalance() tries to spread responsibility
evenly over channels of similar capabilities. For example if there are two
copy+xor channels, one will handle copy operations and the other will
handle xor. In the SMP case async_tx_rebalance() attempts to spread the
operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor
channel0 while cpu1 gets copy channel 1 and xor channel 1. When a
dependency is specified async_tx_find_channel defaults to keeping the
operation on the same channel. A xor->copy->xor chain will stay on one
channel if it supports both operation types, otherwise the transaction will
transition between a copy and a xor resource.

Currently the raid5 implementation in the MD raid456 driver has been
converted to the async_tx api. A driver for the offload engines on the
Intel Xscale series of I/O processors, iop-adma, is provided in a later
commit. With the iop-adma driver and async_tx, raid456 is able to offload
copy, xor, and xor-zero-sum operations to hardware engines.

On iop342 tiobench showed higher throughput for sequential writes (20 - 30%
improvement) and sequential reads to a degraded array (40 - 55%
improvement). For the other cases performance was roughly equal, +/- a few
percentage points. On a x86-smp platform the performance of the async_tx
implementation (in synchronous mode) was also +/- a few percentage points
of the original implementation. According to 'top' on iop342 CPU
utilization drops from ~50% to ~15% during a 'resync' while the speed
according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s.

The tiobench command line used for testing was: tiobench --size 2048
--block 4096 --block 131072 --dir /mnt/raid --numruns 5
* iop342 had 1GB of memory available

Details:
* if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making
async_tx_find_channel a static inline routine that always returns NULL
* when a callback is specified for a given transaction an interrupt will
fire at operation completion time and the callback will occur in a
tasklet. if the the channel does not support interrupts then a live
polling wait will be performed
* the api is written as a dmaengine client that requests all available
channels
* In support of dependencies the api implicitly schedules channel-switch
interrupts. The interrupt triggers the cleanup tasklet which causes
pending operations to be scheduled on the next channel
* Xor engines treat an xor destination address differently than a software
xor routine. To the software routine the destination address is an implied
source, whereas engines treat it as a write-only destination. This patch
modifies the xor_blocks routine to take a an explicit destination address
to mirror the hardware.

Changelog:
* fixed a leftover debug print
* don't allow callbacks in async_interrupt_cond
* fixed xor_block changes
* fixed usage of ASYNC_TX_XOR_DROP_DEST
* drop dma mapping methods, suggested by Chris Leech
* printk warning fixups from Andrew Morton
* don't use inline in C files, Adrian Bunk
* select the API when MD is enabled
* BUG_ON xor source counts
Signed-off-by: Dan Williams
Acked-By: NeilBrown

Dan Williams
2007-07-13 23:06:14 +0800