16 Apr, 2015

1 commit

  • Pull crypto update from Herbert Xu:
    "Here is the crypto update for 4.1:

    New interfaces:
    - user-space interface for AEAD
    - user-space interface for RNG (i.e., pseudo RNG)

    New hashes:
    - ARMv8 SHA1/256
    - ARMv8 AES
    - ARMv8 GHASH
    - ARM assembler and NEON SHA256
    - MIPS OCTEON SHA1/256/512
    - MIPS img-hash SHA1/256 and MD5
    - Power 8 VMX AES/CBC/CTR/GHASH
    - PPC assembler AES, SHA1/256 and MD5
    - Broadcom IPROC RNG driver

    Cleanups/fixes:
    - prevent internal helper algos from being exposed to user-space
    - merge common code from assembly/C SHA implementations
    - misc fixes"

    * git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (169 commits)
    crypto: arm - workaround for building with old binutils
    crypto: arm/sha256 - avoid sha256 code on ARMv7-M
    crypto: x86/sha512_ssse3 - move SHA-384/512 SSSE3 implementation to base layer
    crypto: x86/sha256_ssse3 - move SHA-224/256 SSSE3 implementation to base layer
    crypto: x86/sha1_ssse3 - move SHA-1 SSSE3 implementation to base layer
    crypto: arm64/sha2-ce - move SHA-224/256 ARMv8 implementation to base layer
    crypto: arm64/sha1-ce - move SHA-1 ARMv8 implementation to base layer
    crypto: arm/sha2-ce - move SHA-224/256 ARMv8 implementation to base layer
    crypto: arm/sha256 - move SHA-224/256 ASM/NEON implementation to base layer
    crypto: arm/sha1-ce - move SHA-1 ARMv8 implementation to base layer
    crypto: arm/sha1_neon - move SHA-1 NEON implementation to base layer
    crypto: arm/sha1 - move SHA-1 ARM asm implementation to base layer
    crypto: sha512-generic - move to generic glue implementation
    crypto: sha256-generic - move to generic glue implementation
    crypto: sha1-generic - move to generic glue implementation
    crypto: sha512 - implement base layer for SHA-512
    crypto: sha256 - implement base layer for SHA-256
    crypto: sha1 - implement base layer for SHA-1
    crypto: api - remove instance when test failed
    crypto: api - Move alg ref count init to crypto_check_alg
    ...

    Linus Torvalds
     

13 Apr, 2015

2 commits

  • Old versions of binutils (before 2.23) do not yet understand the
    crypto-neon-fp-armv8 fpu instructions, and an attempt to build these
    files results in a build failure:

    arch/arm/crypto/aes-ce-core.S:133: Error: selected processor does not support ARM mode `vld1.8 {q10-q11},[ip]!'
    arch/arm/crypto/aes-ce-core.S:133: Error: bad instruction `aese.8 q0,q8'
    arch/arm/crypto/aes-ce-core.S:133: Error: bad instruction `aesmc.8 q0,q0'
    arch/arm/crypto/aes-ce-core.S:133: Error: bad instruction `aese.8 q0,q9'
    arch/arm/crypto/aes-ce-core.S:133: Error: bad instruction `aesmc.8 q0,q0'

    Since the affected versions are still in widespread use, and this breaks
    'allmodconfig' builds, we should try to at least get a successful kernel
    build. Unfortunately, I could not come up with a way to make the Kconfig
    symbol depend on the binutils version, which would be the nicest solution.

    Instead, this patch uses the 'as-instr' Kbuild macro to find out whether
    the support is present in the assembler, and otherwise emits a non-fatal
    warning indicating which selected modules could not be built.

    Signed-off-by: Arnd Bergmann
    Link: http://storage.kernelci.org/next/next-20150410/arm-allmodconfig/build.log
    Fixes: 864cbeed4ab22d ("crypto: arm - add support for SHA1 using ARMv8 Crypto Instructions")
    [ard.biesheuvel:
    - omit modules entirely instead of building empty ones if binutils is too old
    - update commit log accordingly]
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Herbert Xu

    Ard Biesheuvel
     
  • The sha256 assembly implementation can deal with all architecture levels
    from ARMv4 to ARMv7-A, but not with ARMv7-M. Enabling it in an
    ARMv7-M kernel results in this build failure:

    arm-linux-gnueabi-ld: error: arch/arm/crypto/sha256_glue.o: Conflicting architecture profiles M/A
    arm-linux-gnueabi-ld: failed to merge target specific data of file arch/arm/crypto/sha256_glue.o

    This adds a Kconfig dependency to prevent the code from being disabled
    for ARMv7-M.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Herbert Xu

    Arnd Bergmann
     

10 Apr, 2015

5 commits


03 Apr, 2015

1 commit

  • Add Andy Polyakov's optimized assembly and NEON implementations for
    SHA-256/224.

    The sha256-armv4.pl script for generating the assembly code is from
    OpenSSL commit 51f8d095562f36cdaa6893597b5c609e943b0565.

    Compared to sha256-generic these implementations have the following
    tcrypt speed improvements on Motorola Nexus 6 (Snapdragon 805):

    bs b/u sha256-neon sha256-asm
    16 16 x1.32 x1.19
    64 16 x1.27 x1.15
    64 64 x1.36 x1.20
    256 16 x1.22 x1.11
    256 64 x1.36 x1.19
    256 256 x1.59 x1.23
    1024 16 x1.21 x1.10
    1024 256 x1.65 x1.23
    1024 1024 x1.76 x1.25
    2048 16 x1.21 x1.10
    2048 256 x1.66 x1.23
    2048 1024 x1.78 x1.25
    2048 2048 x1.79 x1.25
    4096 16 x1.20 x1.09
    4096 256 x1.66 x1.23
    4096 1024 x1.79 x1.26
    4096 4096 x1.82 x1.26
    8192 16 x1.20 x1.09
    8192 256 x1.67 x1.23
    8192 1024 x1.80 x1.26
    8192 4096 x1.85 x1.28
    8192 8192 x1.85 x1.27

    Where bs refers to block size and b/u to bytes per update.

    Signed-off-by: Sami Tolvanen
    Cc: Andy Polyakov
    Signed-off-by: Herbert Xu

    Sami Tolvanen
     

31 Mar, 2015

3 commits


24 Mar, 2015

1 commit


12 Mar, 2015

5 commits


02 Mar, 2015

1 commit

  • This updates the bit sliced AES module to the latest version in the
    upstream OpenSSL repository (e620e5ae37bc). This is needed to fix a
    bug in the XTS decryption path, where data chunked in a certain way
    could trigger the ciphertext stealing code, which is not supposed to
    be active in the kernel build (The kernel implementation of XTS only
    supports round multiples of the AES block size of 16 bytes, whereas
    the conformant OpenSSL implementation of XTS supports inputs of
    arbitrary size by applying ciphertext stealing). This is fixed in
    the upstream version by adding the missing #ifndef XTS_CHAIN_TWEAK
    around the offending instructions.

    The upstream code also contains the change applied by Russell to
    build the code unconditionally, i.e., even if __LINUX_ARM_ARCH__ < 7,
    but implemented slightly differently.

    Cc: stable@vger.kernel.org
    Fixes: e4e7f10bfc40 ("ARM: add support for bit sliced AES using NEON instructions")
    Reported-by: Adrian Kotelba
    Signed-off-by: Ard Biesheuvel
    Tested-by: Milan Broz
    Signed-off-by: Herbert Xu

    Ard Biesheuvel
     

02 Dec, 2014

1 commit

  • Memset on a local variable may be removed when it is called just before the
    variable goes out of scope. Using memzero_explicit defeats this
    optimization. A simplified version of the semantic patch that makes this
    change is as follows: (http://coccinelle.lip6.fr/)

    //
    @@
    identifier x;
    type T;
    @@

    {
    ... when any
    T x[...];
    ... when any
    when exists
    - memset
    + memzero_explicit
    (x,
    -0,
    ...)
    ... when != x
    when strict
    }
    //

    This change was suggested by Daniel Borkmann

    Signed-off-by: Julia Lawall
    Acked-by: Ard Biesheuvel
    Signed-off-by: Herbert Xu

    Julia Lawall
     

24 Nov, 2014

1 commit


27 Aug, 2014

1 commit


06 Aug, 2014

1 commit

  • Pull ARM updates from Russell King:
    "Included in this update:

    - perf updates from Will Deacon:

    The main changes are callchain stability fixes from Jean Pihet and
    event mapping and PMU name rework from Mark Rutland

    The latter is preparatory work for enabling some code re-use with
    arm64 in the future.

    - updates for nommu from Uwe Kleine-König:

    Two different fixes for the same problem making some ARM nommu
    configurations not boot since 3.6-rc1. The problem is that
    user_addr_max returned the biggest available RAM address which
    makes some copy_from_user variants fail to read from XIP memory.

    - deprecate legacy OMAP DMA API, in preparation for it's removal.

    The popular drivers have been converted over, leaving a very small
    number of rarely used drivers, which hopefully can be converted
    during the next cycle with a bit more visibility (and hopefully
    people popping out of the woodwork to help test)

    - more tweaks for BE systems, particularly with the kernel image
    format. In connection with this, I've cleaned up the way we
    generate the linker script for the decompressor.

    - removal of hard-coded assumptions of the kernel stack size, making
    everywhere depend on the value of THREAD_SIZE_ORDER.

    - MCPM updates from Nicolas Pitre.

    - Make it easier for proper CPU part number checks (which should
    always include the vendor field).

    - Assembly code optimisation - use the "bx" instruction when
    returning from a function on ARMv6+ rather than "mov pc, reg".

    - Save the last kernel misaligned fault location and report it via
    the procfs alignment file.

    - Clean up the way we create the initial stack frame, which is a
    repeated pattern in several different locations.

    - Support for 8-byte get_user(), needed for some DRM implementations.

    - mcs locking from Will Deacon.

    - Save and restore a few more Cortex-A9 registers (for errata
    workarounds)

    - Fix various aspects of the SWP emulation, and the ELF hwcap for the
    SWP instruction.

    - Update LPAE logic for pte_write and pmd_write to make it more
    correct.

    - Support for Broadcom Brahma15 CPU cores.

    - ARM assembly crypto updates from Ard Biesheuvel"

    * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (53 commits)
    ARM: add comments to the early page table remap code
    ARM: 8122/1: smp_scu: enable SCU standby support
    ARM: 8121/1: smp_scu: use macro for SCU enable bit
    ARM: 8120/1: crypto: sha512: add ARM NEON implementation
    ARM: 8119/1: crypto: sha1: add ARM NEON implementation
    ARM: 8118/1: crypto: sha1/make use of common SHA-1 structures
    ARM: 8113/1: remove remaining definitions of PLAT_PHYS_OFFSET from
    ARM: 8111/1: Enable erratum 798181 for Broadcom Brahma-B15
    ARM: 8110/1: do CPU-specific init for Broadcom Brahma15 cores
    ARM: 8109/1: mm: Modify pte_write and pmd_write logic for LPAE
    ARM: 8108/1: mm: Introduce {pte,pmd}_isset and {pte,pmd}_isclear
    ARM: hwcap: disable HWCAP_SWP if the CPU advertises it has exclusives
    ARM: SWP emulation: only initialise on ARMv7 CPUs
    ARM: SWP emulation: always enable when SMP is enabled
    ARM: 8103/1: save/restore Cortex-A9 CP15 registers on suspend/resume
    ARM: 8098/1: mcs lock: implement wfe-based polling for MCS locking
    ARM: 8091/2: add get_user() support for 8 byte types
    ARM: 8097/1: unistd.h: relocate comments back to place
    ARM: 8096/1: Describe required sort order for textofs-y (TEXT_OFFSET)
    ARM: 8090/1: add revision info for PL310 errata 588369 and 727915
    ...

    Linus Torvalds
     

02 Aug, 2014

3 commits

  • This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
    algorithms.

    tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:

    block-size bytes/update old-vs-new
    16 16 2.99x
    64 16 2.67x
    64 64 3.00x
    256 16 2.64x
    256 64 3.06x
    256 256 3.33x
    1024 16 2.53x
    1024 256 3.39x
    1024 1024 3.52x
    2048 16 2.50x
    2048 256 3.41x
    2048 1024 3.54x
    2048 2048 3.57x
    4096 16 2.49x
    4096 256 3.42x
    4096 1024 3.56x
    4096 4096 3.59x
    8192 16 2.48x
    8192 256 3.42x
    8192 1024 3.56x
    8192 4096 3.60x
    8192 8192 3.60x

    Acked-by: Ard Biesheuvel
    Tested-by: Ard Biesheuvel
    Signed-off-by: Jussi Kivilinna
    Signed-off-by: Russell King

    Jussi Kivilinna
     
  • This patch adds ARM NEON assembly implementation of SHA-1 algorithm.

    tcrypt benchmark results on Cortex-A8, sha1-arm-asm vs sha1-neon-asm:

    block-size bytes/update old-vs-new
    16 16 1.04x
    64 16 1.02x
    64 64 1.05x
    256 16 1.03x
    256 64 1.04x
    256 256 1.30x
    1024 16 1.03x
    1024 256 1.36x
    1024 1024 1.52x
    2048 16 1.03x
    2048 256 1.39x
    2048 1024 1.55x
    2048 2048 1.59x
    4096 16 1.03x
    4096 256 1.40x
    4096 1024 1.57x
    4096 4096 1.62x
    8192 16 1.03x
    8192 256 1.40x
    8192 1024 1.58x
    8192 4096 1.63x
    8192 8192 1.63x

    Acked-by: Ard Biesheuvel
    Tested-by: Ard Biesheuvel
    Signed-off-by: Jussi Kivilinna
    Signed-off-by: Russell King

    Jussi Kivilinna
     
  • Common SHA-1 structures are defined in for code sharing.

    This patch changes SHA-1/ARM glue code to use these structures.

    Acked-by: Ard Biesheuvel
    Signed-off-by: Jussi Kivilinna
    Signed-off-by: Russell King

    Jussi Kivilinna
     

28 Jul, 2014

1 commit


18 Jul, 2014

1 commit

  • ARMv6 and greater introduced a new instruction ("bx") which can be used
    to return from function calls. Recent CPUs perform better when the
    "bx lr" instruction is used rather than the "mov pc, lr" instruction,
    and this sequence is strongly recommended to be used by the ARM
    architecture manual (section A.4.1.1).

    We provide a new macro "ret" with all its variants for the condition
    code which will resolve to the appropriate instruction.

    Rather than doing this piecemeal, and miss some instances, change all
    the "mov pc" instances to use the new macro, with the exception of
    the "movs" instruction and the kprobes code. This allows us to detect
    the "mov pc, lr" case and fix it up - and also gives us the possibility
    of deploying this for other registers depending on the CPU selection.

    Reported-by: Will Deacon
    Tested-by: Stephen Warren # Tegra Jetson TK1
    Tested-by: Robert Jarzmik # mioa701_bootresume.S
    Tested-by: Andrew Lunn # Kirkwood
    Tested-by: Shawn Guo
    Tested-by: Tony Lindgren # OMAPs
    Tested-by: Gregory CLEMENT # Armada XP, 375, 385
    Acked-by: Sekhar Nori # DaVinci
    Acked-by: Christoffer Dall # kvm/hyp
    Acked-by: Haojian Zhuang # PXA3xx
    Acked-by: Stefano Stabellini # Xen
    Tested-by: Uwe Kleine-König # ARMv7M
    Tested-by: Simon Horman # Shmobile
    Signed-off-by: Russell King

    Russell King
     

05 Jan, 2014

1 commit

  • Building a multi-arch kernel results in:

    arch/arm/crypto/built-in.o: In function `aesbs_xts_decrypt':
    sha1_glue.c:(.text+0x15c8): undefined reference to `bsaes_xts_decrypt'
    arch/arm/crypto/built-in.o: In function `aesbs_xts_encrypt':
    sha1_glue.c:(.text+0x1664): undefined reference to `bsaes_xts_encrypt'
    arch/arm/crypto/built-in.o: In function `aesbs_ctr_encrypt':
    sha1_glue.c:(.text+0x184c): undefined reference to `bsaes_ctr32_encrypt_blocks'
    arch/arm/crypto/built-in.o: In function `aesbs_cbc_decrypt':
    sha1_glue.c:(.text+0x19b4): undefined reference to `bsaes_cbc_encrypt'

    This code is already runtime-conditional on NEON being supported, so
    there's no point compiling it out depending on the minimum build
    architecture.

    Acked-by: Ard Biesheuvel
    Signed-off-by: Russell King

    Russell King
     

07 Oct, 2013

1 commit


05 Oct, 2013

1 commit

  • Bit sliced AES gives around 45% speedup on Cortex-A15 for encryption
    and around 25% for decryption. This implementation of the AES algorithm
    does not rely on any lookup tables so it is believed to be invulnerable
    to cache timing attacks.

    This algorithm processes up to 8 blocks in parallel in constant time. This
    means that it is not usable by chaining modes that are strictly sequential
    in nature, such as CBC encryption. CBC decryption, however, can benefit from
    this implementation and runs about 25% faster. The other chaining modes
    implemented in this module, XTS and CTR, can execute fully in parallel in
    both directions.

    The core code has been adopted from the OpenSSL project (in collaboration
    with the original author, on cc). For ease of maintenance, this version is
    identical to the upstream OpenSSL code, i.e., all modifications that were
    required to make it suitable for inclusion into the kernel have been made
    upstream. The original can be found here:

    http://git.openssl.org/gitweb/?p=openssl.git;a=commit;h=6f6a6130

    Note to integrators:
    While this implementation is significantly faster than the existing table
    based ones (generic or ARM asm), especially in CTR mode, the effects on
    power efficiency are unclear as of yet. This code does fundamentally more
    work, by calculating values that the table based code obtains by a simple
    lookup; only by doing all of that work in a SIMD fashion, it manages to
    perform better.

    Cc: Andy Polyakov
    Acked-by: Nicolas Pitre
    Signed-off-by: Ard Biesheuvel

    Ard Biesheuvel
     

04 Oct, 2013

1 commit


22 Sep, 2013

1 commit

  • Patch 638591c enabled building the AES assembler code in Thumb2 mode.
    However, this code used arithmetic involving PC rather than adr{l}
    instructions to generate PC-relative references to the lookup tables,
    and this needs to take into account the different PC offset when
    running in Thumb mode.

    Signed-off-by: Ard Biesheuvel
    Acked-by: Nicolas Pitre
    Cc: stable@vger.kernel.org
    Signed-off-by: Russell King

    Ard Biesheuvel
     

23 May, 2013

1 commit


13 Jan, 2013

1 commit

  • This patch fixes aes-armv4.S and sha1-armv4-large.S to work
    natively in Thumb. This allows ARM/Thumb interworking workarounds
    to be removed.

    I also take the opportunity to convert some explicit assembler
    directives for exported functions to the standard
    ENTRY()/ENDPROC().

    For the code itself:

    * In sha1_block_data_order, use of TEQ with sp is deprecated in
    ARMv7 and not supported in Thumb. For the branches back to
    .L_00_15 and .L_40_59, the TEQ is converted to a CMP, under the
    assumption that clobbering the C flag here will not cause
    incorrect behaviour.

    For the first branch back to .L_20_39_or_60_79 the C flag is
    important, so sp is moved temporarily into another register so
    that TEQ can be used for the comparison.

    * In the AES code, most forms of register-indexed addressing with
    shifts and rotates are not permitted for loads and stores in
    Thumb, so the address calculation is done using a separate
    instruction for the Thumb case.

    The resulting code is unlikely to be optimally scheduled, but it
    should not have a large impact given the overall size of the code.
    I haven't run any benchmarks.

    Signed-off-by: Dave Martin
    Tested-by: David McCullough (ARM only)
    Acked-by: David McCullough
    Acked-by: Nicolas Pitre
    Signed-off-by: Russell King

    Dave Martin
     

07 Sep, 2012

1 commit

  • Add assembler versions of AES and SHA1 for ARM platforms. This has provided
    up to a 50% improvement in IPsec/TCP throughout for tunnels using AES128/SHA1.

    Platform CPU SPeed Endian Before (bps) After (bps) Improvement

    IXP425 533 MHz big 11217042 15566294 ~38%
    KS8695 166 MHz little 3828549 5795373 ~51%

    Signed-off-by: David McCullough
    Signed-off-by: Herbert Xu

    David McCullough