27 Nov, 2020

2 commits

  • Instead of calculating the tag and returning it to the caller on
    decryption, use a SIMD compare and min across vector to perform
    the comparison. This is slightly more efficient, and removes the
    need on the caller's part to wipe the tag from memory if the
    decryption failed.

    While at it, switch to unsigned int when passing cryptlen and
    assoclen - we don't support input sizes where it matters anyway.

    Signed-off-by: Ard Biesheuvel
    Reviewed-by: Ondrej Mosnacek
    Signed-off-by: Herbert Xu

    Ard Biesheuvel
     
  • Avoid copying the tail block via a stack buffer if the total size
    exceeds a single AEGIS block. In this case, we can use overlapping
    loads and stores and NEON permutation instructions instead, which
    leads to a modest performance improvement on some cores (< 5%),
    and is slightly cleaner. Note that we still need to use a stack
    buffer if the entire input is smaller than 16 bytes, given that
    we cannot use 16 byte NEON loads and stores safely in this case.

    Signed-off-by: Ard Biesheuvel
    Reviewed-by: Ondrej Mosnacek
    Signed-off-by: Herbert Xu

    Ard Biesheuvel
     

25 Oct, 2019

1 commit


30 Aug, 2019

1 commit

  • When building the new aegis128 NEON code in big endian mode, Clang
    complains about the const uint8x16_t permute vectors in the following
    way:

    crypto/aegis128-neon-inner.c:58:40: warning: vector initializers are not
    compatible with NEON intrinsics in big endian mode
    [-Wnonportable-vector-initialization]
    static const uint8x16_t shift_rows = {
    ^
    crypto/aegis128-neon-inner.c:58:40: note: consider using vld1q_u8() to
    initialize a vector from memory, or vcombine_u8(vcreate_u8(), vcreate_u8())
    to initialize from integer constants

    Since the same issue applies to the uint8x16x4_t loads of the AES Sbox,
    update those references as well. However, since GCC does not implement
    the vld1q_u8_x4() intrinsic, switch from IS_ENABLED() to a preprocessor
    conditional to conditionally include this code.

    Reported-by: Nathan Chancellor
    Signed-off-by: Ard Biesheuvel
    Tested-by: Nathan Chancellor
    Signed-off-by: Herbert Xu

    Ard Biesheuvel
     

15 Aug, 2019

2 commits

  • Provide a version of the core AES transform to the aegis128 SIMD
    code that does not rely on the special AES instructions, but uses
    plain NEON instructions instead. This allows the SIMD version of
    the aegis128 driver to be used on arm64 systems that do not
    implement those instructions (which are not mandatory in the
    architecture), such as the Raspberry Pi 3.

    Since GCC makes a mess of this when using the tbl/tbx intrinsics
    to perform the sbox substitution, preload the Sbox into v16..v31
    in this case and use inline asm to emit the tbl/tbx instructions.
    Clang does not support this approach, nor does it require it, since
    it does a much better job at code generation, so there we use the
    intrinsics as usual.

    Cc: Nick Desaulniers
    Signed-off-by: Ard Biesheuvel
    Acked-by: Nick Desaulniers
    Signed-off-by: Herbert Xu

    Ard Biesheuvel
     
  • Provide an accelerated implementation of aegis128 by wiring up the
    SIMD hooks in the generic driver to an implementation based on NEON
    intrinsics, which can be compiled to both ARM and arm64 code.

    This results in a performance of 2.2 cycles per byte on Cortex-A53,
    which is a performance increase of ~11x compared to the generic
    code.

    Reviewed-by: Ondrej Mosnacek
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Herbert Xu

    Ard Biesheuvel
     

02 Aug, 2019

1 commit

  • This reverts commit ecc8bc81f2fb3976737ef312f824ba6053aa3590
    ("crypto: aegis128 - provide a SIMD implementation based on NEON
    intrinsics") and commit 7cdc0ddbf74a19cecb2f0e9efa2cae9d3c665189
    ("crypto: aegis128 - add support for SIMD acceleration").

    They cause compile errors on platforms other than ARM because
    the mechanism to selectively compile the SIMD code is broken.

    Repoted-by: Heiko Carstens
    Reported-by: Stephen Rothwell
    Signed-off-by: Herbert Xu

    Herbert Xu
     

26 Jul, 2019

1 commit

  • Provide an accelerated implementation of aegis128 by wiring up the
    SIMD hooks in the generic driver to an implementation based on NEON
    intrinsics, which can be compiled to both ARM and arm64 code.

    This results in a performance of 2.2 cycles per byte on Cortex-A53,
    which is a performance increase of ~11x compared to the generic
    code.

    Reviewed-by: Ondrej Mosnacek
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Herbert Xu

    Ard Biesheuvel