Eric Lee / smarc-fsl-linux-kernel

09 Sep, 2009

24 commits

138f4c359 dmaengine, async_tx: add a "no channel switch" allocator ... Browse Code »

Channel switching is problematic for some dmaengine drivers as the
architecture precludes separating the ->prep from ->submit. In these
cases the driver can select ASYNC_TX_DISABLE_CHANNEL_SWITCH to modify
the async_tx allocator to only return channels that support all of the
required asynchronous operations.

For example MD_RAID456=y selects support for asynchronous xor, xor
validate, pq, pq validate, and memcpy. When
ASYNC_TX_DISABLE_CHANNEL_SWITCH=y any channel with all these
capabilities is marked DMA_ASYNC_TX allowing async_tx_find_channel() to
quickly locate compatible channels with the guarantee that dependency
chains will remain on one channel. When
ASYNC_TX_DISABLE_CHANNEL_SWITCH=n async_tx_find_channel() may select
channels that lead to operation chains that need to cross channel
boundaries using the async_tx channel switch capability.

Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:42:51 +0800
0403e3827 dmaengine: add fence support ... Browse Code »

Some engines optimize operation by reading ahead in the descriptor chain
such that descriptor2 may start execution before descriptor1 completes.
If descriptor2 depends on the result from descriptor1 then a fence is
required (on descriptor2) to disable this optimization. The async_tx
api could implicitly identify dependencies via the 'depend_tx'
parameter, but that would constrain cases where the dependency chain
only specifies a completion order rather than a data dependency. So,
provide an ASYNC_TX_FENCE to explicitly identify data dependencies.

Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:42:50 +0800
f9dd21343 Merge branch 'md-raid6-accel' into ioat3.2 ... Browse Code »

Conflicts:
include/linux/dmaengine.h

Dan Williams
2009-09-09 08:42:29 +0800
4b652f0db net_dma: poll for a descriptor after allocation failure ... Browse Code »

Handle descriptor allocation failures by polling for a descriptor. The
driver will force forward progress when polled. In the best case this
polling interval will be the time it takes for one dma memcpy
transaction to complete. In the worst case, channel hang, we will need
to wait 100ms for the cleanup watchdog to fire (ioatdma driver).

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:38:54 +0800
a309218ac ioat2,3: dynamically resize descriptor ring ... Browse Code »

Increment the allocation order of the descriptor ring every time we run
out of descriptors up to a maximum of allocation order specified by the
module parameter 'ioat_max_alloc_order'. After each idle period
decrement the allocation order to a minimum order of
'ioat_ring_alloc_order' (i.e. the default ring size, tunable as a module
parameter).

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:38:54 +0800
09c8a5b85 ioat: switch watchdog and reset handler from workqueue to timer ... Browse Code »

In order to support dynamic resizing of the descriptor ring or polling
for a descriptor in the presence of a hung channel the reset handler
needs to make progress while in a non-preemptible context. The current
workqueue implementation precludes polling channel reset completion
under spin_lock().

This conversion also allows us to return to opportunistic cleanup in the
ioat2 case as the timer implementation guarantees at least one cleanup
after every descriptor is submitted. This means the worst case
completion latency becomes the timer frequency (for exceptional
circumstances), but with the benefit of avoiding busy waiting when the
lock is contended.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:30:24 +0800
ad643f54c ioat1: trim ioat_dma_desc_sw ... Browse Code »

Save 4 bytes per software descriptor by transmitting tx_cnt in an unused
portion of the hardware descriptor.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:30:24 +0800
345d85239 ioat: ___devinit annotate the initialization paths ... Browse Code »

Mark all single use initialization routines with __devinit.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:30:24 +0800
f6ab95b55 ioat: preserve chanctrl bits when re-arming interrupts ... Browse Code »

The register write in ioat_dma_cleanup_tasklet is unfortunate in two
ways:
1/ It clears the extra 'enable' bits that we set at alloc_chan_resources time
2/ It gives the impression that it disables interrupts when it is in
fact re-arming interrupts

[ Impact: fix, persist the value of the chanctrl register when re-arming ]

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:30:24 +0800
bb3207863 ioat: ignore reserved bits for chancnt and xfercap ... Browse Code »

Don't trust that the reserved bits are always zero, also sanity check
the returned value.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:30:24 +0800
4fb9b9e8d ioat: cleanup completion status reads ... Browse Code »

The cleanup path makes an effort to only perform an atomic read of the
64-bit completion address. However in the 32-bit case it does not
matter if we read the upper-32 and lower-32 non-atomically because the
upper-32 will always be zero.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:30:24 +0800
6df9183a1 ioat: add some dev_dbg() calls ... Browse Code »

Provide some output for debugging the driver.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:30:23 +0800
38e12f64a ioat1: kill unused unmap parameters ... Browse Code »

The unified ioat1/ioat2 ioat_dma_unmap() implementation derives the
source and dest addresses from the unmap descriptor. There is no longer
a need to track this information in struct ioat_desc_sw.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:30:23 +0800
5cbafa65b ioat2,3: convert to a true ring buffer ... Browse Code »

Replace the current linked list munged into a ring with a native ring
buffer implementation. The benefit of this approach is reduced overhead
as many parameters can be derived from ring position with simple pointer
comparisons and descriptor allocation/freeing becomes just a
manipulation of head/tail pointers.

It requires a contiguous allocation for the software descriptor
information.

Since this arrangement is significantly different from the ioat1 chain,
move ioat2,3 support into its own file and header. Common routines are
exported from driver/dma/ioat/dma.[ch].

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:55 +0800
dcbc853af ioat: prepare the code for ioat[12]_dma_chan split ... Browse Code »

Prepare the code for the conversion of the ioat2 linked-list-ring into a
native ring buffer. After this conversion ioat2 channels will share
less of the ioat1 infrastructure, but there will still be places where
sharing is possible. struct ioat_chan_common is created to house the
channel attributes that will remain common between ioat1 and ioat2
channels.

For every routine that accesses both common and hardware specific fields
the old unified 'ioat_chan' pointer is split into an 'ioat' and 'chan'
pointer. Where 'chan' references common fields and 'ioat' the
hardware/version specific.

[ Impact: pure structure member movement/variable renames, no logic changes ]

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:55 +0800
a6a39ca1b ioat: fix self test interrupts ... Browse Code »

If a callback is to be attached to a descriptor the channel needs to
know at ->prep time so it can set the interrupt enable bit. This is in
preparation for moving descriptor ioat2 descriptor preparation from
->submit to ->prep.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:55 +0800
a0587bcf3 ioat1: move descriptor allocation from submit to prep ... Browse Code »

The async_tx api assumes that after a successful ->prep a subsequent
->submit will not fail due to a lack of resources.

This also fixes a bug in the allocation failure case. Previously the
descriptors allocated prior to the allocation failure would not be
returned to the free list.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:55 +0800
c7984f4e4 ioat: define descriptor control bit-field ... Browse Code »

This cleans up a mess of and'ing and or'ing bit definitions, and allows
simple assignments from the specified dma_ctrl_flags parameter.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:55 +0800
77867fff0 ioat: fix type mismatch for ->dmacount ... Browse Code »

->dmacount tracks the sequence number of active descriptors. It is
written to the DMACOUNT register to update the channel's view of pending
descriptors in the chain. The register is 16-bits so ->dmacount should
be unsigned and 16-bit as well. Also modify ->desccount to maintain
alignment.

This was never a problem in practice because we never compared dmacount
values, but this is a bug waiting to happen.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:54 +0800
f2427e276 ioat: split ioat_dma_probe into core/version-specific routines ... Browse Code »

Towards the removal of ioatdma_device.version split the initialization
path into distinct versions. This conversion:
1/ moves version specific probe code to version specific routines
2/ removes the need for ioat_device
3/ turns off the ioat1 msi quirk if the device is reinitialized for intx

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:54 +0800
b31b78f1a ioat: kill function prototype ifdef guards ... Browse Code »

The only .c files that utilize these protected prototypes depend on
CONFIG_INTEL_IOATDMA=y, so there is no value gained in providing empty
prototypes.

[ Impact: pure cleanup ]

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:54 +0800
bc3c70258 ioat: cleanup some long deref chains and 80 column collisions ... Browse Code »

* reduce device->common. to dma-> in ioat_dma_{probe,remove,selftest}
* ioat_lookup_chan_by_index to ioat_chan_by_index
* multi-line function definitions
* ioat_desc_sw.async_tx to ioat_desc_sw.txd
* desc->txd. to tx-> in cleanup routine

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:54 +0800
e6c0b69a4 ioat: convert ioat_probe to pcim/devm ... Browse Code »

The driver currently duplicates much of what these routines offer, so
just use the common code. For example ->irq_mode tracks what interrupt
mode was initialized, which duplicates the ->msix_enabled and
->msi_enabled handling in pcim_release.

This also adds a check to the return value of dma_async_device_register,
which can fail.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:44 +0800
1f27adc2f ioat: move definitions to dma.h ... Browse Code »

Some of these defines may be useful outside of dma.c and the header is
private so there are no namespace pollution concerns.

Signed-off-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-09-09 08:29:02 +0800

30 Aug, 2009

16 commits

07a3b417d md/raid456: distribute raid processing over multiple cores ... Browse Code »

Now that the resources to handle stripe_head operations are allocated
percpu it is possible for raid5d to distribute stripe handling over
multiple cores. This conversion also adds a call to cond_resched() in
the non-multicore case to prevent one core from getting monopolized for
raid operations.

Cc: Arjan van de Ven
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:13:13 +0800
b774ef491 md/raid6: remove synchronous infrastructure ... Browse Code »

These routines have been replaced by there asynchronous counterparts.

Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Signed-off-by: Dan Williams

Yuri Tikhonov
2009-08-30 10:13:13 +0800
6c0069c0a md/raid6: asynchronous handle_stripe6 ... Browse Code »
1

1/ Use STRIPE_OP_BIOFILL to offload completion of read requests to
raid_run_ops
2/ Implement a handler for sh->reconstruct_state similar to the raid5 case
(adds handling of Q parity)
3/ Prevent handle_parity_checks6 from running concurrently with 'compute'
operations
4/ Hook up raid_run_ops

Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Signed-off-by: Dan Williams

Yuri Tikhonov
2009-08-30 10:13:13 +0800
d82dfee0a md/raid6: asynchronous handle_parity_check6 ... Browse Code »

[ Based on an original patch by Yuri Tikhonov ]

Implement the state machine for handling the RAID-6 parities check and
repair functionality. Note that the raid6 case does not need to check
for new failures, like raid5, as it will always writeback the correct
disks. The raid5 case can be updated to check zero_sum_result to avoid
getting confused by new failures rather than retrying the entire check
operation.

Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:13:13 +0800
a9b39a741 md/raid6: asynchronous handle_stripe_dirtying6 ... Browse Code »

In the synchronous implementation of stripe dirtying we processed a
degraded stripe with one call to handle_stripe_dirtying6(). I.e.
compute the missing blocks from the other drives, then copy in the new
data and reconstruct the parities.

In the asynchronous case we do not perform stripe operations directly.
Instead, operations are scheduled with flags to be later serviced by
raid_run_ops. So, for the degraded case the final reconstruction step
can only be carried out after all blocks have been brought up to date by
being read, or computed. Like the raid5 case schedule_reconstruction()
sets STRIPE_OP_RECONSTRUCT to request a parity generation pass and
through operation chaining can handle compute and reconstruct in a
single raid_run_ops pass.

[dan.j.williams@intel.com: fixup handle_stripe_dirtying6 gating]
Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Signed-off-by: Dan Williams

Yuri Tikhonov
2009-08-30 10:13:12 +0800
5599becca md/raid6: asynchronous handle_stripe_fill6 ... Browse Code »

Modify handle_stripe_fill6 to work asynchronously by introducing
fetch_block6 as the raid6 analog of fetch_block5 (schedule compute
operations for missing/out-of-sync disks).

[dan.j.williams@intel.com: compute D+Q in one pass]
Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Signed-off-by: Dan Williams

Yuri Tikhonov
2009-08-30 10:13:12 +0800
c0f7bddbe md/raid5,6: common schedule_reconstruction for raid5/6 ... Browse Code »

Extend schedule_reconstruction5 for reuse by the raid6 path. Add
support for generating Q and BUG() if a request is made to perform
'prexor'.

Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Signed-off-by: Dan Williams

Yuri Tikhonov
2009-08-30 10:13:12 +0800
ac6b53b6e md/raid6: asynchronous raid6 operations ... Browse Code »

[ Based on an original patch by Yuri Tikhonov ]

The raid_run_ops routine uses the asynchronous offload api and
the stripe_operations member of a stripe_head to carry out xor+pq+copy
operations asynchronously, outside the lock.

The operations performed by RAID-6 are the same as in the RAID-5 case
except for no support of STRIPE_OP_PREXOR operations. All the others
are supported:
STRIPE_OP_BIOFILL
- copy data into request buffers to satisfy a read request
STRIPE_OP_COMPUTE_BLK
- generate missing blocks (1 or 2) in the cache from the other blocks
STRIPE_OP_BIODRAIN
- copy data out of request buffers to satisfy a write request
STRIPE_OP_RECONSTRUCT
- recalculate parity for new data that has entered the cache
STRIPE_OP_CHECK
- verify that the parity is correct

The flow is the same as in the RAID-5 case, and reuses some routines, namely:
1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
2/ ops_complete_compute (updated to set up to 2 targets uptodate)
3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)

[neilb@suse.de: fixes to get it to pass mdadm regression suite]
Reviewed-by: Andre Noll
Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:13:12 +0800
4e7d2c0ae md/raid5: factor out mark_uptodate from ops_complete_compute5 ... Browse Code »

ops_complete_compute5 can be reused in the raid6 path if it is updated to
generically handle a second target.

Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:13:11 +0800
cb3c82992 async_tx: raid6 recovery self test ... Browse Code »

Port drivers/md/raid6test/test.c to use the async raid6 recovery
routines. This is meant as a unit test for raid6 acceleration drivers. In
addition to the 16-drive test case this implements tests for the 4-disk and
5-disk special cases (dma devices can not generically handle less than 2
sources), and adds a test for the D+Q case.

Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:28 +0800
58691d64c dmatest: add pq support ... Browse Code »

Test raid6 p+q operations with a simple "always multiply by 1" q
calculation to fit into dmatest's current destination verification
scheme.

Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:27 +0800
0a82a6239 async_tx: add support for asynchronous RAID6 recovery operations ... Browse Code »

async_raid6_2data_recov() recovers two data disk failures

async_raid6_datap_recov() recovers a data disk and the P disk

These routines are a port of the synchronous versions found in
drivers/md/raid6recov.c. The primary difference is breaking out the xor
operations into separate calls to async_xor. Two helper routines are
introduced to perform scalar multiplication where needed.
async_sum_product() multiplies two sources by scalar coefficients and
then sums (xor) the result. async_mult() simply multiplies a single
source by a scalar.

This implemention also includes, in contrast to the original
synchronous-only code, special case handling for the 4-disk and 5-disk
array cases. In these situations the default N-disk algorithm will
present 0-source or 1-source operations to dma devices. To cover for
dma devices where the minimum source count is 2 we implement 4-disk and
5-disk handling in the recovery code.

[ Impact: asynchronous raid6 recovery routines for 2data and datap cases ]

Cc: Yuri Tikhonov
Cc: Ilya Yanok
Cc: H. Peter Anvin
Cc: David Woodhouse
Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:27 +0800
b2f46fd8e async_tx: add support for asynchronous GF multiplication ... Browse Code »

[ Based on an original patch by Yuri Tikhonov ]

This adds support for doing asynchronous GF multiplication by adding
two additional functions to the async_tx API:

async_gen_syndrome() does simultaneous XOR and Galois field
multiplication of sources.

async_syndrome_val() validates the given source buffers against known P
and Q values.

When a request is made to run async_pq against more than the hardware
maximum number of supported sources we need to reuse the previous
generated P and Q values as sources into the next operation. Care must
be taken to remove Q from P' and P from Q'. For example to perform a 5
source pq op with hardware that only supports 4 sources at a time the
following approach is taken:

p, q = PQ(src0, src1, src2, src3, COEF({01}, {02}, {04}, {08}))
p', q' = PQ(p, q, q, src4, COEF({00}, {01}, {00}, {10}))

p' = p + q + q + src4 = p + src4
q' = {00}*p + {01}*q + {00}*q + {10}*src4 = q + {10}*src4

Note: 4 is the minimum acceptable maxpq otherwise we punt to
synchronous-software path.

The DMA_PREP_CONTINUE flag indicates to the driver to reuse p and q as
sources (in the above manner) and fill the remaining slots up to maxpq
with the new sources/coefficients.

Note1: Some devices have native support for P+Q continuation and can skip
this extra work. Devices with this capability can advertise it with
dma_set_maxpq. It is up to each driver how to handle the
DMA_PREP_CONTINUE flag.

Note2: The api supports disabling the generation of P when generating Q,
this is ignored by the synchronous path but is implemented by some dma
devices to save unnecessary writes. In this case the continuation
algorithm is simplified to only reuse Q as a source.

Cc: H. Peter Anvin
Cc: David Woodhouse
Signed-off-by: Yuri Tikhonov
Signed-off-by: Ilya Yanok
Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:27 +0800
95475e571 async_tx: remove walk of tx->parent chain in dma_wait_for_async_tx ... Browse Code »

We currently walk the parent chain when waiting for a given tx to
complete however this walk may race with the driver cleanup routine.
The routines in async_raid6_recov.c may fall back to the synchronous
path at any point so we need to be prepared to call async_tx_quiesce()
(which calls dma_wait_for_async_tx). To remove the ->parent walk we
guarantee that every time a dependency is attached ->issue_pending() is
invoked, then we can simply poll the initial descriptor until
completion.

This also allows for a lighter weight 'issue pending' implementation as
there is no longer a requirement to iterate through all the channels'
->issue_pending() routines as long as operations have been submitted in
an ordered chain. async_tx_issue_pending() is added for this case.

Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:27 +0800
af1f951eb async_tx: kill needless module_{init|exit} ... Browse Code »

If module_init and module_exit are nops then neither need to be defined.

[ Impact: pure cleanup ]

Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:26 +0800
ad283ea4a async_tx: add sum check flags ... Browse Code »

Replace the flat zero_sum_result with a collection of flags to contain
the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
solomon syndrome) zero-sum result. Use the SUM_CHECK_ namespace instead
of DMA_ since these flags will be used on non-dma-zero-sum enabled
platforms.

Reviewed-by: Andre Noll
Acked-by: Maciej Sosnowski
Signed-off-by: Dan Williams

Dan Williams
2009-08-30 10:09:26 +0800