21 Feb, 2012

3 commits

  • commit f2ea0f5f04c97b48c88edccba52b0682fbe45087 upstream.

    Use standard ror64() instead of hand-written.
    There is no standard ror64, so create it.

    The difference is shift value being "unsigned int" instead of uint64_t
    (for which there is no reason). gcc starts to emit native ROR instructions
    which it doesn't do for some reason currently. This should make the code
    faster.

    Patch survives in-tree crypto test and ping flood with hmac(sha512) on.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Herbert Xu
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     
  • commit 3a92d687c8015860a19213e3c102cad6b722f83c upstream.

    Unfortunately in reducing W from 80 to 16 we ended up unrolling
    the loop twice. As gcc has issues dealing with 64-bit ops on
    i386 this means that we end up using even more stack space (>1K).

    This patch solves the W reduction by moving LOAD_OP/BLEND_OP
    into the loop itself, thus avoiding the need to duplicate it.

    While the stack space still isn't great (>0.5K) it is at least
    in the same ball park as the amount of stack used for our C sha1
    implementation.

    Note that this patch basically reverts to the original code so
    the diff looks bigger than it really is.

    Signed-off-by: Herbert Xu
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • commit 58d7d18b5268febb8b1391c6dffc8e2aaa751fcd upstream.

    The previous patch used the modulus operator over a power of 2
    unnecessarily which may produce suboptimal binary code. This
    patch changes changes them to binary ands instead.

    Signed-off-by: Herbert Xu
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     

04 Feb, 2012

2 commits

  • commit 51fc6dc8f948047364f7d42a4ed89b416c6cc0a3 upstream.

    For rounds 16--79, W[i] only depends on W[i - 2], W[i - 7], W[i - 15] and W[i - 16].
    Consequently, keeping all W[80] array on stack is unnecessary,
    only 16 values are really needed.

    Using W[16] instead of W[80] greatly reduces stack usage
    (~750 bytes to ~340 bytes on x86_64).

    Line by line explanation:
    * BLEND_OP
    array is "circular" now, all indexes have to be modulo 16.
    Round number is positive, so remainder operation should be
    without surprises.

    * initial full message scheduling is trimmed to first 16 values which
    come from data block, the rest is calculated before it's needed.

    * original loop body is unrolled version of new SHA512_0_15 and
    SHA512_16_79 macros, unrolling was done to not do explicit variable
    renaming. Otherwise it's the very same code after preprocessing.
    See sha1_transform() code which does the same trick.

    Patch survives in-tree crypto test and original bugreport test
    (ping flood with hmac(sha512).

    See FIPS 180-2 for SHA-512 definition
    http://csrc.nist.gov/publications/fips/fips180-2/fips180-2withchangenotice.pdf

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Herbert Xu
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     
  • commit 84e31fdb7c797a7303e0cc295cb9bc8b73fb872d upstream.

    commit f9e2bca6c22d75a289a349f869701214d63b5060
    aka "crypto: sha512 - Move message schedule W[80] to static percpu area"
    created global message schedule area.

    If sha512_update will ever be entered twice, hash will be silently
    calculated incorrectly.

    Probably the easiest way to notice incorrect hashes being calculated is
    to run 2 ping floods over AH with hmac(sha512):

    #!/usr/sbin/setkey -f
    flush;
    spdflush;
    add IP1 IP2 ah 25 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000025;
    add IP2 IP1 ah 52 -A hmac-sha512 0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000052;
    spdadd IP1 IP2 any -P out ipsec ah/transport//require;
    spdadd IP2 IP1 any -P in ipsec ah/transport//require;

    XfrmInStateProtoError will start ticking with -EBADMSG being returned
    from ah_input(). This never happens with, say, hmac(sha1).

    With patch applied (on BOTH sides), XfrmInStateProtoError does not tick
    with multiple bidirectional ping flood streams like it doesn't tick
    with SHA-1.

    After this patch sha512_transform() will start using ~750 bytes of stack on x86_64.
    This is OK for simple loads, for something more heavy, stack reduction will be done
    separatedly.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Herbert Xu
    Signed-off-by: Greg Kroah-Hartman

    Alexey Dobriyan
     

12 Nov, 2011

1 commit


11 Nov, 2011

1 commit


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

02 Nov, 2011

2 commits

  • The list_empty case in crypto_alg_match() will return without calling
    up_read() on crypto_alg_sem. We could do the "goto out" routine, but the
    function will clearly do the right thing with that test simply removed.

    Signed-off-by: Jonathan Corbet
    Signed-off-by: Herbert Xu

    Jonathan Corbet
     
  • * git://github.com/herbertx/crypto: (48 commits)
    crypto: user - Depend on NET instead of selecting it
    crypto: user - Add dependency on NET
    crypto: talitos - handle descriptor not found in error path
    crypto: user - Initialise match in crypto_alg_match
    crypto: testmgr - add twofish tests
    crypto: testmgr - add blowfish test-vectors
    crypto: Make hifn_795x build depend on !ARCH_DMA_ADDR_T_64BIT
    crypto: twofish-x86_64-3way - fix ctr blocksize to 1
    crypto: blowfish-x86_64 - fix ctr blocksize to 1
    crypto: whirlpool - count rounds from 0
    crypto: Add userspace report for compress type algorithms
    crypto: Add userspace report for cipher type algorithms
    crypto: Add userspace report for rng type algorithms
    crypto: Add userspace report for pcompress type algorithms
    crypto: Add userspace report for nivaead type algorithms
    crypto: Add userspace report for aead type algorithms
    crypto: Add userspace report for givcipher type algorithms
    crypto: Add userspace report for ablkcipher type algorithms
    crypto: Add userspace report for blkcipher type algorithms
    crypto: Add userspace report for ahash type algorithms
    ...

    Linus Torvalds
     

01 Nov, 2011

2 commits


26 Oct, 2011

1 commit


21 Oct, 2011

24 commits


22 Sep, 2011

3 commits

  • Patch adds x86_64 assembly implementation of blowfish. Two set of assembler
    functions are provided. First set is regular 'one-block at time'
    encrypt/decrypt functions. Second is 'four-block at time' functions that
    gain performance increase on out-of-order CPUs. Performance of 4-way
    functions should be equal to 1-way functions with in-order CPUs.

    Summary of the tcrypt benchmarks:

    Blowfish assembler vs blowfish C (256bit 8kb block ECB)
    encrypt: 2.2x speed
    decrypt: 2.3x speed

    Blowfish assembler vs blowfish C (256bit 8kb block CBC)
    encrypt: 1.12x speed
    decrypt: 2.5x speed

    Blowfish assembler vs blowfish C (256bit 8kb block CTR)
    encrypt: 2.5x speed

    Full output:
    http://koti.mbnet.fi/axh/kernel/crypto/tcrypt-speed-blowfish-asm-x86_64.txt
    http://koti.mbnet.fi/axh/kernel/crypto/tcrypt-speed-blowfish-c-x86_64.txt

    Tests were run on:
    vendor_id : AuthenticAMD
    cpu family : 16
    model : 10
    model name : AMD Phenom(tm) II X6 1055T Processor
    stepping : 0

    Signed-off-by: Jussi Kivilinna
    Signed-off-by: Herbert Xu

    Jussi Kivilinna
     
  • Add ctr(blowfish) speed test to receive results for blowfish x86_64 assembly
    patch.

    Signed-off-by: Jussi Kivilinna
    Signed-off-by: Herbert Xu

    Jussi Kivilinna
     
  • Rename blowfish to blowfish_generic so that assembler versions of blowfish
    cipher can autoload. Module alias 'blowfish' is added.

    Also fix checkpatch warnings.

    Signed-off-by: Jussi Kivilinna
    Signed-off-by: Herbert Xu

    Jussi Kivilinna