01 Mar, 2019

1 commit

  • Clang warns: vector initializers are not compatible with NEON intrinsics
    in big endian mode [-Wnonportable-vector-initialization]

    While this is usually the case, it's not an issue for this case since
    we're initializing the uint8x16_t (16x uint8_t's) with the same value.

    Instead, use vdupq_n_u8 which both compilers lower into a single movi
    instruction: https://godbolt.org/z/vBrgzt

    This avoids the static storage for a constant value.

    Link: https://github.com/ClangBuiltLinux/linux/issues/214
    Suggested-by: Nathan Chancellor
    Reviewed-by: Ard Biesheuvel
    Signed-off-by: Nick Desaulniers
    Signed-off-by: Catalin Marinas

    ndesaulniers@google.com
     

10 Aug, 2017

1 commit

  • The P/Q left side optimization in the delta syndrome simply involves
    repeatedly multiplying a value by polynomial 'x' in GF(2^8). Given
    that 'x * x * x * x' equals 'x^4' even in the polynomial world, we
    can accelerate this substantially by performing up to 4 such operations
    at once, using the NEON instructions for polynomial multiplication.

    Results on a Cortex-A57 running in 64-bit mode:

    Before:
    -------
    raid6: neonx1 xor() 1680 MB/s
    raid6: neonx2 xor() 2286 MB/s
    raid6: neonx4 xor() 3162 MB/s
    raid6: neonx8 xor() 3389 MB/s

    After:
    ------
    raid6: neonx1 xor() 2281 MB/s
    raid6: neonx2 xor() 3362 MB/s
    raid6: neonx4 xor() 3787 MB/s
    raid6: neonx8 xor() 4239 MB/s

    While we're at it, simplify MASK() by using a signed shift rather than
    a vector compare involving a temp register.

    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Catalin Marinas

    Ard Biesheuvel
     

01 Sep, 2015

1 commit

  • This implements XOR syndrome calculation using NEON intrinsics.
    As before, the module can be built for ARM and arm64 from the
    same source.

    Relative performance on a Cortex-A57 based system:

    raid6: int64x1 gen() 905 MB/s
    raid6: int64x1 xor() 881 MB/s
    raid6: int64x2 gen() 1343 MB/s
    raid6: int64x2 xor() 1286 MB/s
    raid6: int64x4 gen() 1896 MB/s
    raid6: int64x4 xor() 1321 MB/s
    raid6: int64x8 gen() 1773 MB/s
    raid6: int64x8 xor() 1165 MB/s
    raid6: neonx1 gen() 1834 MB/s
    raid6: neonx1 xor() 1278 MB/s
    raid6: neonx2 gen() 2528 MB/s
    raid6: neonx2 xor() 1942 MB/s
    raid6: neonx4 gen() 2888 MB/s
    raid6: neonx4 xor() 2334 MB/s
    raid6: neonx8 gen() 2957 MB/s
    raid6: neonx8 xor() 2232 MB/s
    raid6: using algorithm neonx8 gen() 2957 MB/s
    raid6: .... xor() 2232 MB/s, rmw enabled

    Cc: Markus Stockhausen
    Cc: Neil Brown
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: NeilBrown

    Ard Biesheuvel
     

09 Jul, 2013

1 commit

  • Rebased/reworked a patch contributed by Rob Herring that uses
    NEON intrinsics to perform the RAID-6 syndrome calculations.
    It uses the existing unroll.awk code to generate several
    unrolled versions of which the best performing one is selected
    at boot time.

    Signed-off-by: Ard Biesheuvel
    Acked-by: Nicolas Pitre
    Cc: hpa@linux.intel.com

    Ard Biesheuvel