| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
"API:
- Replace crypto_get_default_rng with crypto_stdrng_get_bytes
- Remove simd skcipher support
- Allow algorithm types to be disabled when CRYPTO_SELFTESTS is off
Algorithms:
- Remove CPU-based des/3des acceleration
- Add test vectors for authenc(hmac(md5),cbc({aes,des})) and
authenc(hmac({md5,sha1,sha224,sha256,sha384,sha512}),rfc3686(ctr(aes)))
- Replace spin lock with mutex in jitterentropy
Drivers:
- Add authenc algorithms to safexcel
- Add support for zstd in qat
- Add wireless mode support for QAT GEN6
- Add anti-rollback support for QAT GEN6
- Add support for ctr(aes), gcm(aes), and ccm(aes) in dthev2"
* tag 'v7.1-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (129 commits)
crypto: af_alg - use sock_kmemdup in alg_setkey_by_key_serial
crypto: vmx - remove CRYPTO_DEV_VMX from Kconfig
crypto: omap - convert reqctx buffer to fixed-size array
crypto: atmel-sha204a - add Thorsten Blum as maintainer
crypto: atmel-ecc - add Thorsten Blum as maintainer
crypto: qat - fix IRQ cleanup on 6xxx probe failure
crypto: geniv - Remove unused spinlock from struct aead_geniv_ctx
crypto: qce - simplify qce_xts_swapiv()
crypto: hisilicon - Fix dma_unmap_single() direction
crypto: talitos - rename first/last to first_desc/last_desc
crypto: talitos - fix SEC1 32k ahash request limitation
crypto: jitterentropy - replace long-held spinlock with mutex
crypto: hisilicon - remove unused and non-public APIs for qm and sec
crypto: hisilicon/qm - drop redundant variable initialization
crypto: hisilicon/qm - remove else after return
crypto: hisilicon/qm - add const qualifier to info_name in struct qm_cmd_dump_item
crypto: hisilicon - fix the format string type error
crypto: ccree - fix a memory leak in cc_mac_digest()
crypto: qat - add support for zstd
crypto: qat - use swab32 macro
...
|
|
Since DES and Triple DES are obsolete, there is very little point in
maintining architecture-optimized code for them. Remove it.
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Instead of exposing the x86-optimized SM3 code via an x86-specific
crypto_shash algorithm, instead just implement the sm3_blocks() library
function. This is much simpler, it makes the SM3 library functions be
x86-optimized, and it fixes the longstanding issue where the
x86-optimized SM3 code was disabled by default. SM3 still remains
available through crypto_shash, but individual architectures no longer
need to handle it.
Tweak the prototype of sm3_transform_avx() to match what the library
expects, including changing the block count to size_t. Note that the
assembly code actually already treated this argument as size_t.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20260321040935.410034-10-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Remove the "ghash-pclmulqdqni" crypto_shash algorithm. Move the
corresponding assembly code into lib/crypto/, and wire it up to the
GHASH library.
This makes the GHASH library be optimized with x86's carryless
multiplication instructions. It also greatly reduces the amount of
x86-specific glue code that is needed, and it fixes the issue where this
GHASH optimization was disabled by default.
Rename and adjust the prototypes of the assembly functions to make them
fit better with the library. Remove the byte-swaps (pshufb
instructions) that are no longer necessary because the library keeps the
accumulator in POLYVAL format rather than GHASH format.
Rename clmul_ghash_mul() to polyval_mul_pclmul() to reflect that it
really does a POLYVAL style multiplication. Wire it up to both
ghash_mul_arch() and polyval_mul_arch().
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20260319061723.1140720-15-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Migrate the x86_64 implementations of NH into lib/crypto/. This makes
the nh() function be optimized on x86_64 kernels.
Note: this temporarily makes the adiantum template not utilize the
x86_64 optimized NH code. This is resolved in a later commit that
converts the adiantum template to use nh() instead of "nhpoly1305".
Link: https://lore.kernel.org/r/20251211011846.8179-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull AES-GCM optimizations from Eric Biggers:
"More optimizations and cleanups for the x86_64 AES-GCM code:
- Add a VAES+AVX2 optimized implementation of AES-GCM. This is very
helpful on CPUs that have VAES but not AVX512, such as AMD Zen 3.
- Make the VAES+AVX512 optimized implementation of AES-GCM handle
large amounts of associated data efficiently.
- Remove the "avx10_256" implementation of AES-GCM. It's superseded
by the VAES+AVX2 optimized implementation.
- Rename the "avx10_512" implementation to "avx512"
Overall, this fills in a gap where AES-GCM wasn't fully optimized on
some recent CPUs. It also drops code that won't be as useful as
initially expected due to AVX10/256 being dropped from the AVX10 spec"
* tag 'aes-gcm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
crypto: x86/aes-gcm-vaes-avx2 - initialize full %rax return register
crypto: x86/aes-gcm - optimize long AAD processing with AVX512
crypto: x86/aes-gcm - optimize AVX512 precomputation of H^2 from H^1
crypto: x86/aes-gcm - revise some comments in AVX512 code
crypto: x86/aes-gcm - reorder AVX512 precompute and aad_update functions
crypto: x86/aes-gcm - clean up AVX512 code to assume 512-bit vectors
crypto: x86/aes-gcm - rename avx10 and avx10_512 to avx512
crypto: x86/aes-gcm - remove VAES+AVX10/256 optimized code
crypto: x86/aes-gcm - add VAES+AVX2 optimized code
|
|
Migrate the x86_64 implementation of POLYVAL into lib/crypto/, wiring it
up to the POLYVAL library interface. This makes the POLYVAL library be
properly optimized on x86_64.
This drops the x86_64 optimizations of polyval in the crypto_shash API.
That's fine, since polyval will be removed from crypto_shash entirely
since it is unneeded there. But even if it comes back, the crypto_shash
API could just be implemented on top of the library API, as usual.
Adjust the names and prototypes of the assembly functions to align more
closely with the rest of the library code.
Also replace a movaps instruction with movups to remove the assumption
that the key struct is 16-byte aligned. Users can still align the key
if they want (and at least in this case, movups is just as fast as
movaps), but it's inconvenient to require it.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251109234726.638437-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
With the "avx10_256" code removed and the AVX10 specification having
been changed to basically just be a re-packaged AVX512, the "avx10_512"
name no longer makes sense. Replace it with "avx512".
While doing this, also add the "vaes_" prefix in places that didn't
already have it. The result is that the two VAES optimized
implementations are consistently called vaes_avx2 and vaes_avx512.
(Also drop the "-x86_64" part of the assembly filename, to keep it from
getting too long. There's no 32-bit version of this code, and the fact
that it's 64-bit is unremarkable; it's the norm for new code.)
Note: although aes_gcm_aad_update_vaes_avx512() (previously called
aes_gcm_aad_update_vaes_avx10()) uses at most 256-bit vectors, it still
depends on the AVX512 CPU feature. So its new name is still accurate.
Also, a later commit will make it sometimes use 512-bit vectors anyway.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251002023117.37504-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Add an implementation of AES-GCM that uses 256-bit vectors and the
following CPU features: Vector AES (VAES), Vector Carryless
Multiplication (VPCLMULQDQ), and AVX2.
It doesn't require AVX512. So unlike the existing VAES+AVX512 code, it
works on CPUs that support VAES but not AVX512, specifically:
- AMD Zen 3, both client and server
- Intel Alder Lake, Raptor Lake, Meteor Lake, Arrow Lake, and Lunar
Lake. (These are client CPUs.)
- Intel Sierra Forest. (This is a server CPU.)
On these CPUs, this VAES+AVX2 code is much faster than the existing
AES-NI code. The AES-NI code uses only 128-bit vectors.
These CPUs are widely deployed, making VAES+AVX2 code worthwhile even
though hopefully future x86_64 CPUs will uniformly support AVX512.
This implementation will also serve as the fallback 256-bit
implementation for older Intel CPUs (Ice Lake and Tiger Lake) that
support AVX512 but downclock too eagerly when 512-bit vectors are used.
Currently, the VAES+AVX10/256 implementation serves that purpose. A
later commit will remove that and just use the VAES+AVX2 one. (Note
that AES-XTS and AES-CTR already successfully use this approach.)
I originally wrote this AES-GCM implementation for BoringSSL. It's been
in BoringSSL for a while now, including in Chromium. This is a port of
it to the Linux kernel. The main changes in the Linux version include:
- Port from "perlasm" to a standard .S file.
- Align all assembly functions with what aesni-intel_glue.c expects,
including adding support for lengths not a multiple of 16 bytes.
- Rework the en/decryption of the final 1 to 127 bytes.
This commit increases AES-256-GCM throughput on AMD Milan (Zen 3) by up
to 74%, as shown by the following tables:
Table 1: AES-256-GCM encryption throughput change,
CPU vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
AMD Milan (Zen 3) | 67% | 59% | 61% | 39% | 23% | 27% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
AMD Milan (Zen 3) | 14% | 12% | 7% | 7% | 0% |
Table 2: AES-256-GCM decryption throughput change,
CPU vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
AMD Milan (Zen 3) | 74% | 65% | 65% | 44% | 23% | 26% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
AMD Milan (Zen 3) | 12% | 11% | 3% | 2% | -3% |
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20251002023117.37504-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- Simplify inline asm flag output operands now that the minimum
compiler version supports the =@ccCOND syntax
- Remove a bunch of AS_* Kconfig symbols which detect assembler support
for various instruction mnemonics now that the minimum assembler
version supports them all
- The usual cleanups all over the place
* tag 'x86_cleanups_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm: Remove code depending on __GCC_ASM_FLAG_OUTPUTS__
x86/sgx: Use ENCLS mnemonic in <kernel/cpu/sgx/encls.h>
x86/mtrr: Remove license boilerplate text with bad FSF address
x86/asm: Use RDPKRU and WRPKRU mnemonics in <asm/special_insns.h>
x86/idle: Use MONITORX and MWAITX mnemonics in <asm/mwait.h>
x86/entry/fred: Push __KERNEL_CS directly
x86/kconfig: Remove CONFIG_AS_AVX512
crypto: x86 - Remove CONFIG_AS_VPCLMULQDQ
crypto: X86 - Remove CONFIG_AS_VAES
crypto: x86 - Remove CONFIG_AS_GFNI
x86/kconfig: Drop unused and needless config X86_64_SMP
|
|
Reorganize the Curve25519 library code:
- Build a single libcurve25519 module, instead of up to three modules:
libcurve25519, libcurve25519-generic, and an arch-specific module.
- Move the arch-specific Curve25519 code from arch/$(SRCARCH)/crypto/ to
lib/crypto/$(SRCARCH)/. Centralize the build rules into
lib/crypto/Makefile and lib/crypto/Kconfig.
- Include the arch-specific code directly in lib/crypto/curve25519.c via
a header, rather than using a separate .c file.
- Eliminate the entanglement with CRYPTO. CRYPTO_LIB_CURVE25519 no
longer selects CRYPTO, and the arch-specific Curve25519 code no longer
depends on CRYPTO.
This brings Curve25519 in line with the latest conventions for
lib/crypto/, used by other algorithms. The exception is that I kept the
generic code in separate translation units for now. (Some of the
function names collide between the x86 and generic Curve25519 code. And
the Curve25519 functions are very long anyway, so inlining doesn't
matter as much for Curve25519 as it does for some other algorithms.)
Link: https://lore.kernel.org/r/20250906213523.84915-11-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Current minimum required version of binutils is 2.30, which supports VPCLMULQDQ
instruction mnemonics.
Remove check for assembler support of VPCLMULQDQ instructions and all relevant
macros for conditional compilation.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/20250819085855.333380-3-ubizjak@gmail.com
|
|
Current minimum required version of binutils is 2.30, which supports VAES
instruction mnemonics.
Remove check for assembler support of VAES instructions and all relevant macros
for conditional compilation.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/20250819085855.333380-2-ubizjak@gmail.com
|
|
Instead of exposing the x86-optimized SHA-1 code via x86-specific
crypto_shash algorithms, instead just implement the sha1_blocks()
library function. This is much simpler, it makes the SHA-1 library
functions be x86-optimized, and it fixes the longstanding issue where
the x86-optimized SHA-1 code was disabled by default. SHA-1 still
remains available through crypto_shash, but individual architectures no
longer need to handle it.
To match sha1_blocks(), change the type of the nblocks parameter of the
assembly functions from int to size_t. The assembly functions actually
already treated it as size_t.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250712232329.818226-14-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Instead of exposing the x86-optimized SHA-512 code via x86-specific
crypto_shash algorithms, instead just implement the sha512_blocks()
library function. This is much simpler, it makes the SHA-512 (and
SHA-384) library functions be x86-optimized, and it fixes the
longstanding issue where the x86-optimized SHA-512 code was disabled by
default. SHA-512 still remains available through crypto_shash, but
individual architectures no longer need to handle it.
To match sha512_blocks(), change the type of the nblocks parameter of
the assembly functions from int to size_t. The assembly functions
actually already treated it as size_t.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250630160320.2888-15-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Instead of providing crypto_shash algorithms for the arch-optimized
SHA-256 code, instead implement the SHA-256 library. This is much
simpler, it makes the SHA-256 library functions be arch-optimized, and
it fixes the longstanding issue where the arch-optimized SHA-256 was
disabled by default. SHA-256 still remains available through
crypto_shash, but individual architectures no longer need to handle it.
To match sha256_blocks_arch(), change the type of the nblocks parameter
of the assembly functions from int to size_t. The assembly functions
actually already treated it as size_t.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Continue disentangling the crypto library functions from the generic
crypto infrastructure by moving the x86 BLAKE2s, ChaCha, and Poly1305
library functions into a new directory arch/x86/lib/crypto/ that does
not depend on CRYPTO. This mirrors the distinction between crypto/ and
lib/crypto/.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Current minimum required version of binutils is 2.25,
which supports AVX-512 instruction mnemonics.
Remove check for assembler support of AVX-512 instructions
and all relevant macros for conditional compilation.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Current minimum required version of binutils is 2.25,
which supports SHA-256 instruction mnemonics.
Remove check for assembler support of SHA-256 instructions
and all relevant macros for conditional compilation.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Current minimum required version of binutils is 2.25,
which supports SHA-1 instruction mnemonics.
Remove check for assembler support of SHA-1 instructions
and all relevant macros for conditional compilation.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Delete aes_ctrby8_avx-x86_64.S and add a new assembly file
aes-ctr-avx-x86_64.S which follows a similar approach to
aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX,
VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just
AESNI+AVX. Wire it up to the crypto API accordingly.
This greatly improves the performance of AES-CTR and AES-XCTR on
VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230%
increase in throughput is seen on long messages. Performance on
non-VAES-capable CPUs remains about the same, and the non-AVX AES-CTR
code (aesni_ctr_enc) is also kept as-is for now. There are some slight
regressions (less than 10%) on some short message lengths on some CPUs;
these are difficult to avoid, given how the previous code was so heavily
unrolled by message length, and they are not particularly important.
Detailed performance results are given in the tables below.
Both CTR and XCTR support is retained. The main loop remains
8-vector-wide, which differs from the 4-vector-wide main loops that are
used in the XTS and GCM code. A wider loop is appropriate for CTR and
XCTR since they have fewer other instructions (such as vpclmulqdq) to
interleave with the AES instructions.
Similar to what was the case for AES-GCM, the new assembly code also has
a much smaller binary size, as it fixes the excessive unrolling by data
length and key length present in the old code. Specifically, the new
assembly file compiles to about 9 KB of text vs. 28 KB for the old file.
This is despite 4x as many implementations being included.
The tables below show the detailed performance results. The tables show
percentage improvement in single-threaded throughput for repeated
encryption of the given message length; an increase from 6000 MB/s to
12000 MB/s would be listed as 100%. They were collected by directly
measuring the Linux crypto API performance using a custom kernel module.
The tested CPUs were all server processors from Google Compute Engine
except for Zen 5 which was a Ryzen 9 9950X desktop processor.
Table 1: AES-256-CTR throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
---------------------+-------+-------+-------+-------+-------+-------+
AMD Zen 5 | 232% | 203% | 212% | 143% | 71% | 95% |
Intel Emerald Rapids | 116% | 116% | 117% | 91% | 78% | 79% |
Intel Ice Lake | 109% | 103% | 107% | 81% | 54% | 56% |
AMD Zen 4 | 109% | 91% | 100% | 70% | 43% | 59% |
AMD Zen 3 | 92% | 78% | 87% | 57% | 32% | 43% |
AMD Zen 2 | 9% | 8% | 14% | 12% | 8% | 21% |
Intel Skylake | 7% | 7% | 8% | 5% | 3% | 8% |
| 300 | 200 | 64 | 63 | 16 |
---------------------+-------+-------+-------+-------+-------+
AMD Zen 5 | 57% | 39% | -9% | 7% | -7% |
Intel Emerald Rapids | 37% | 42% | -0% | 13% | -8% |
Intel Ice Lake | 39% | 30% | -1% | 14% | -9% |
AMD Zen 4 | 42% | 38% | -0% | 18% | -3% |
AMD Zen 3 | 38% | 35% | 6% | 31% | 5% |
AMD Zen 2 | 24% | 23% | 5% | 30% | 3% |
Intel Skylake | 9% | 1% | -4% | 10% | -7% |
Table 2: AES-256-XCTR throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
---------------------+-------+-------+-------+-------+-------+-------+
AMD Zen 5 | 240% | 201% | 216% | 151% | 75% | 108% |
Intel Emerald Rapids | 100% | 99% | 102% | 91% | 94% | 104% |
Intel Ice Lake | 93% | 89% | 92% | 74% | 50% | 64% |
AMD Zen 4 | 86% | 75% | 83% | 60% | 41% | 52% |
AMD Zen 3 | 73% | 63% | 69% | 45% | 21% | 33% |
AMD Zen 2 | -2% | -2% | 2% | 3% | -1% | 11% |
Intel Skylake | -1% | -1% | 1% | 2% | -1% | 9% |
| 300 | 200 | 64 | 63 | 16 |
---------------------+-------+-------+-------+-------+-------+
AMD Zen 5 | 78% | 56% | -4% | 38% | -2% |
Intel Emerald Rapids | 61% | 55% | 4% | 32% | -5% |
Intel Ice Lake | 57% | 42% | 3% | 44% | -4% |
AMD Zen 4 | 35% | 28% | -1% | 17% | -3% |
AMD Zen 3 | 26% | 23% | -3% | 11% | -6% |
AMD Zen 2 | 13% | 24% | -1% | 14% | -3% |
Intel Skylake | 16% | 8% | -4% | 35% | -3% |
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Move the x86 CRC-T10DIF assembly code into the lib directory and wire it
up to the library interface. This allows it to be used without going
through the crypto API. It remains usable via the crypto API too via
the shash algorithms that use the library interface. Thus all the
arch-specific "shash" code becomes unnecessary and is removed.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241202012056.209768-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Move the x86 CRC32 assembly code into the lib directory and wire it up
to the library interface. This allows it to be used without going
through the crypto API. It remains usable via the crypto API too via
the shash algorithms that use the library interface. Thus all the
arch-specific "shash" code becomes unnecessary and is removed.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20241202010844.144356-14-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Rewrite the AES-NI implementations of AES-GCM, taking advantage of
things I learned while writing the VAES-AVX10 implementations. This is
a complete rewrite that reduces the AES-NI GCM source code size by about
70% and the binary code size by about 95%, while not regressing
performance and in fact improving it significantly in many cases.
The following summarizes the state before this patch:
- The aesni-intel module registered algorithms "generic-gcm-aesni" and
"rfc4106-gcm-aesni" with the crypto API that actually delegated to one
of three underlying implementations according to the CPU capabilities
detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.
- The AES-NI + AVX and AES-NI + AVX2 assembly code was in
aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
257 KB of binary. This massive binary size was not really
appropriate, and depending on the kconfig it could take up over 1% the
size of the entire vmlinux. The main loops did 8 blocks per
iteration. The AVX code minimized the use of carryless multiplication
whereas the AVX2 code did not. The "AVX2" code did not actually use
AVX2; the check for AVX2 was really a check for Intel Haswell or later
to detect support for fast carryless multiplication. The long source
length was caused by factors such as significant code duplication.
- The AES-NI only assembly code was in aesni-intel_asm.S and consisted
of 1501 lines of source and 15 KB of binary. The main loops did 4
blocks per iteration and minimized the use of carryless multiplication
by using Karatsuba multiplication and a multiplication-less reduction.
- The assembly code was contributed in 2010-2013. Maintenance has been
sporadic and most design choices haven't been revisited.
- The assembly function prototypes and the corresponding glue code were
separate from and were not consistent with the new VAES-AVX10 code I
recently added. The older code had several issues such as not
precomputing the GHASH key powers, which hurt performance.
This rewrite achieves the following goals:
- Much shorter source and binary sizes. The assembly source shrinks
from 4300 lines to 1130 lines, and it produces about 9 KB of binary
instead of 272 KB. This is achieved via a better designed AES-GCM
implementation that doesn't excessively unroll the code and instead
prioritizes the parts that really matter. Sharing the C glue code
with the VAES-AVX10 implementations also saves 250 lines of C source.
- Improve performance on most (possibly all) CPUs on which this code
runs, for most (possibly all) message lengths. Benchmark results are
given in Tables 1 and 2 below.
- Use the same function prototypes and glue code as the new VAES-AVX10
algorithms. This fixes some issues with the integration of the
assembly and results in some significant performance improvements,
primarily on short messages. Also, the AVX and non-AVX
implementations are now registered as separate algorithms with the
crypto API, which makes them both testable by the self-tests.
- Keep support for AES-NI without AVX (for Westmere, Silvermont,
Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
Since 256-bit vectors cannot be used without VAES anyway, this is made
feasible by just using the non-VEX coded form of most instructions.
- Use a unified approach where the main loop does 8 blocks per iteration
and uses Karatsuba multiplication to save one pclmulqdq per block but
does not use the multiplication-less reduction. This strikes a good
balance across the range of CPUs on which this code runs.
- Don't spam the kernel log with an informational message on every boot.
The following tables summarize the improvement in AES-GCM throughput on
various CPU microarchitectures as a result of this patch:
Table 1: AES-256-GCM encryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell | 2% | 8% | 11% | 18% | 31% | 26% |
Intel Skylake | 1% | 4% | 7% | 12% | 26% | 19% |
Intel Cascade Lake | 3% | 8% | 10% | 18% | 33% | 24% |
AMD Zen 1 | 6% | 12% | 6% | 15% | 27% | 24% |
AMD Zen 2 | 8% | 13% | 13% | 19% | 26% | 28% |
AMD Zen 3 | 8% | 14% | 13% | 19% | 26% | 25% |
| 300 | 200 | 64 | 63 | 16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell | 35% | 29% | 45% | 55% | 54% |
Intel Skylake | 25% | 19% | 28% | 33% | 27% |
Intel Cascade Lake | 36% | 28% | 39% | 49% | 54% |
AMD Zen 1 | 27% | 22% | 23% | 29% | 26% |
AMD Zen 2 | 32% | 24% | 22% | 25% | 31% |
AMD Zen 3 | 30% | 24% | 22% | 23% | 26% |
Table 2: AES-256-GCM decryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell | 3% | 8% | 11% | 19% | 32% | 28% |
Intel Skylake | 3% | 4% | 7% | 13% | 28% | 27% |
Intel Cascade Lake | 3% | 9% | 11% | 19% | 33% | 28% |
AMD Zen 1 | 15% | 18% | 14% | 20% | 36% | 33% |
AMD Zen 2 | 9% | 16% | 13% | 21% | 26% | 27% |
AMD Zen 3 | 8% | 15% | 12% | 18% | 23% | 23% |
| 300 | 200 | 64 | 63 | 16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell | 36% | 31% | 40% | 51% | 53% |
Intel Skylake | 28% | 21% | 23% | 30% | 30% |
Intel Cascade Lake | 36% | 29% | 36% | 47% | 53% |
AMD Zen 1 | 35% | 31% | 32% | 35% | 36% |
AMD Zen 2 | 31% | 30% | 27% | 38% | 30% |
AMD Zen 3 | 27% | 23% | 24% | 32% | 26% |
The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
listed as 10%. They were collected by directly measuring the Linux
crypto API performance using a custom kernel module. Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference. All
these benchmarks used an associated data length of 16 bytes. Note that
AES-GCM is almost always used with short associated data lengths.
I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
Intel low-power CPUs, as these weren't readily available to me.
However, based on the design of the new code and the available
information about these other CPU microarchitectures, I wouldn't expect
any significant regressions, and there's a good chance performance is
improved just as it is above.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector
AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or
AVX10. There are two implementations, sharing most source code: one
using 256-bit vectors and one using 512-bit vectors. This patch
improves AES-GCM performance by up to 162%; see Tables 1 and 2 below.
I wrote the new AES-GCM assembly code from scratch, focusing on
correctness, performance, code size (both source and binary), and
documenting the source. The new assembly file aes-gcm-avx10-x86_64.S is
about 1200 lines including extensive comments, and it generates less
than 8 KB of binary code. The main loop does 4 vectors at a time, with
the AES and GHASH instructions interleaved. Any remainder is handled
using a simple 1 vector at a time loop, with masking.
Several VAES + AVX512 implementations of AES-GCM exist from Intel,
including one in OpenSSL and one proposed for inclusion in Linux in 2021
(https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/).
These aren't really suitable to be used, though, due to the massive
amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux)
and well as the significantly larger amount of assembly source (4978
lines for OpenSSL, 1788 lines for Linux). Also, Intel's code does not
support 256-bit vectors, which makes it not usable on future
AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have
downclocking issues. So I ended up starting from scratch. Usually my
much shorter code is actually slightly faster than Intel's AVX512 code,
though it depends on message length and on which of Intel's
implementations is used; for details, see Tables 3 and 4 below.
To facilitate potential integration into other projects, I've
dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause,
the same as the recently added RISC-V crypto code.
The following two tables summarize the performance improvement over the
existing AES-GCM code in Linux that uses AES-NI and AVX2:
Table 1: AES-256-GCM encryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake | 42% | 48% | 60% | 62% | 70% | 69% |
Intel Sapphire Rapids | 157% | 145% | 162% | 119% | 96% | 96% |
Intel Emerald Rapids | 156% | 144% | 161% | 115% | 95% | 100% |
AMD Zen 4 | 103% | 89% | 78% | 56% | 54% | 54% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake | 66% | 48% | 49% | 70% | 53% |
Intel Sapphire Rapids | 80% | 60% | 41% | 62% | 38% |
Intel Emerald Rapids | 79% | 60% | 41% | 62% | 38% |
AMD Zen 4 | 51% | 35% | 27% | 32% | 25% |
Table 2: AES-256-GCM decryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
----------------------+-------+-------+-------+-------+-------+-------+
Intel Ice Lake | 42% | 48% | 59% | 63% | 67% | 71% |
Intel Sapphire Rapids | 159% | 145% | 161% | 125% | 102% | 100% |
Intel Emerald Rapids | 158% | 144% | 161% | 124% | 100% | 103% |
AMD Zen 4 | 110% | 95% | 80% | 59% | 56% | 54% |
| 300 | 200 | 64 | 63 | 16 |
----------------------+-------+-------+-------+-------+-------+
Intel Ice Lake | 67% | 56% | 46% | 70% | 56% |
Intel Sapphire Rapids | 79% | 62% | 39% | 61% | 39% |
Intel Emerald Rapids | 80% | 62% | 40% | 58% | 40% |
AMD Zen 4 | 49% | 36% | 30% | 35% | 28% |
The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be
listed as 50%. They were collected by directly measuring the Linux
crypto API performance using a custom kernel module. Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference. All
these benchmarks used an associated data length of 16 bytes. Note that
AES-GCM is almost always used with short associated data lengths.
The following two tables summarize how the performance of my code
compares with Intel's AVX512 AES-GCM code, both the version that is in
OpenSSL and the version that was proposed for inclusion in Linux.
Neither version exists in Linux currently, but these are alternative
AES-GCM implementations that could be chosen instead of mine. I
collected the following numbers on Emerald Rapids using a userspace
benchmark program that calls the assembly functions directly.
I've also included a comparison with Cloudflare's AES-GCM implementation
from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3.
Table 3: VAES-based AES-256-GCM encryption throughput in MB/s,
implementation name vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation | 14171 | 12956 | 12318 | 9588 | 7293 | 6449 |
AVX512_Intel_OpenSSL | 14022 | 12467 | 11863 | 9107 | 5891 | 6472 |
AVX512_Intel_Linux | 13954 | 12277 | 11530 | 8712 | 6627 | 5898 |
AVX512_Cloudflare | 12564 | 11050 | 10905 | 8152 | 5345 | 5202 |
| 300 | 200 | 64 | 63 | 16 |
---------------------+-------+-------+-------+-------+-------+
This implementation | 4939 | 3688 | 1846 | 1821 | 738 |
AVX512_Intel_OpenSSL | 4629 | 4532 | 2734 | 2332 | 1131 |
AVX512_Intel_Linux | 4035 | 2966 | 1567 | 1330 | 639 |
AVX512_Cloudflare | 3344 | 2485 | 1141 | 1127 | 456 |
Table 4: VAES-based AES-256-GCM decryption throughput in MB/s,
implementation name vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
---------------------+-------+-------+-------+-------+-------+-------+
This implementation | 14276 | 13311 | 13007 | 11086 | 8268 | 8086 |
AVX512_Intel_OpenSSL | 14067 | 12620 | 12421 | 9587 | 5954 | 7060 |
AVX512_Intel_Linux | 14116 | 12795 | 11778 | 9269 | 7735 | 6455 |
AVX512_Cloudflare | 13301 | 12018 | 11919 | 9182 | 7189 | 6726 |
| 300 | 200 | 64 | 63 | 16 |
---------------------+-------+-------+-------+-------+-------+
This implementation | 6454 | 5020 | 2635 | 2602 | 1079 |
AVX512_Intel_OpenSSL | 5184 | 5799 | 2957 | 2545 | 1228 |
AVX512_Intel_Linux | 4394 | 4247 | 2235 | 1635 | 922 |
AVX512_Cloudflare | 4289 | 3851 | 1435 | 1417 | 574 |
So, usually my code is actually slightly faster than Intel's code,
though the OpenSSL implementation has a slight edge on messages shorter
than 256 bytes in this microbenchmark. (This also holds true when doing
the same tests on AMD Zen 4.) It can be seen that the large code size
(up to 94x larger!) of the Intel implementations doesn't seem to bring
much benefit, so starting from scratch with much smaller code, as I've
done, seems appropriate. The performance of my code on messages shorter
than 256 bytes could be improved through a limited amount of unrolling,
but it's unclear it would be worth it, given code size considerations
(e.g. caches) that don't get measured in microbenchmarks.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
Add an assembly file aes-xts-avx-x86_64.S which contains a macro that
expands into AES-XTS implementations for x86_64 CPUs that support at
least AES-NI and AVX, optionally also taking advantage of VAES,
VPCLMULQDQ, and AVX512 or AVX10.
This patch doesn't expand the macro at all. Later patches will do so,
adding each implementation individually so that the motivation and use
case for each individual implementation can be fully presented.
The file also provides a function aes_xts_encrypt_iv() which handles the
encryption of the IV (tweak), using AES-NI and AVX.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
aria-avx512 implementation uses AVX512 and GFNI.
It supports 64way parallel processing.
So, byteslicing code is changed to support 64way parallel.
And it exports some aria-avx2 functions such as encrypt() and decrypt().
AVX and AVX2 have 16 registers.
They should use memory to store/load state because of lack of registers.
But AVX512 supports 32 registers.
So, it doesn't require store/load in the s-box layer.
It means that it can reduce overhead of store/load in the s-box layer.
Also code become much simpler.
Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100:
ARIA-AVX512(128bit and 256bit)
testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption
tcrypt: 1 operation in 1504 cycles (1024 bytes)
tcrypt: 1 operation in 4595 cycles (4096 bytes)
tcrypt: 1 operation in 1763 cycles (1024 bytes)
tcrypt: 1 operation in 5540 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption
tcrypt: 1 operation in 1502 cycles (1024 bytes)
tcrypt: 1 operation in 4615 cycles (4096 bytes)
tcrypt: 1 operation in 1759 cycles (1024 bytes)
tcrypt: 1 operation in 5554 cycles (4096 bytes)
ARIA-AVX2 with GFNI(128bit and 256bit)
testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption
tcrypt: 1 operation in 2003 cycles (1024 bytes)
tcrypt: 1 operation in 5867 cycles (4096 bytes)
tcrypt: 1 operation in 2358 cycles (1024 bytes)
tcrypt: 1 operation in 7295 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption
tcrypt: 1 operation in 2004 cycles (1024 bytes)
tcrypt: 1 operation in 5956 cycles (4096 bytes)
tcrypt: 1 operation in 2409 cycles (1024 bytes)
tcrypt: 1 operation in 7564 cycles (4096 bytes)
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
aria-avx2 implementation uses AVX2, AES-NI, and GFNI.
It supports 32way parallel processing.
So, byteslicing code is changed to support 32way parallel.
And it exports some aria-avx functions such as encrypt() and decrypt().
There are two main logics, s-box layer and diffusion layer.
These codes are the same as aria-avx implementation.
But some instruction are exchanged because they don't support 256bit
registers.
Also, AES-NI doesn't support 256bit register.
So, aesenclast and aesdeclast are used twice like below:
vextracti128 $1, ymm0, xmm6;
vaesenclast xmm7, xmm0, xmm0;
vaesenclast xmm7, xmm6, xmm6;
vinserti128 $1, xmm6, ymm0, ymm0;
Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100:
ARIA-AVX2 with GFNI(128bit and 256bit)
testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption
tcrypt: 1 operation in 2003 cycles (1024 bytes)
tcrypt: 1 operation in 5867 cycles (4096 bytes)
tcrypt: 1 operation in 2358 cycles (1024 bytes)
tcrypt: 1 operation in 7295 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption
tcrypt: 1 operation in 2004 cycles (1024 bytes)
tcrypt: 1 operation in 5956 cycles (4096 bytes)
tcrypt: 1 operation in 2409 cycles (1024 bytes)
tcrypt: 1 operation in 7564 cycles (4096 bytes)
ARIA-AVX with GFNI(128bit and 256bit)
testing speed of multibuffer ecb(aria) (ecb-aria-avx) encryption
tcrypt: 1 operation in 2761 cycles (1024 bytes)
tcrypt: 1 operation in 9390 cycles (4096 bytes)
tcrypt: 1 operation in 3401 cycles (1024 bytes)
tcrypt: 1 operation in 11876 cycles (4096 bytes)
testing speed of multibuffer ecb(aria) (ecb-aria-avx) decryption
tcrypt: 1 operation in 2735 cycles (1024 bytes)
tcrypt: 1 operation in 9424 cycles (4096 bytes)
tcrypt: 1 operation in 3369 cycles (1024 bytes)
tcrypt: 1 operation in 11954 cycles (4096 bytes)
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
curve25519-x86_64.c fails to build when CONFIG_GCOV_KERNEL is enabled.
The error is "inline assembly requires more registers than available"
thrown from the `fsqr()` function. Therefore, excluding this file from
GCOV profiling until this issue is resolved. Thereby allowing
CONFIG_GCOV_PROFILE_ALL to be enabled for x86.
Signed-off-by: Joe Fradley <joefradley@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
aria cipher
The implementation is based on the 32-bit implementation of the aria.
Also, aria-avx process steps are the similar to the camellia-avx.
1. Byteslice(16way)
2. Add-round-key.
3. Sbox
4. Diffusion layer.
Except for s-box, all steps are the same as the aria-generic
implementation. s-box step is very similar to camellia and
sm4 implementation.
There are 2 implementations for s-box step.
One is to use AES-NI and affine transformation, which is the same as
Camellia, sm4, and others.
Another is to use GFNI.
GFNI implementation is faster than AES-NI implementation.
So, it uses GFNI implementation if the running CPU supports GFNI.
There are 4 s-boxes in the ARIA and the 2 s-boxes are the same as
AES's s-boxes.
To calculate the first sbox, it just uses the aesenclast and then
inverts shift_row.
No more process is needed for this job because the first s-box is
the same as the AES encryption s-box.
To calculate the second sbox(invert of s1), it just uses the aesdeclast
and then inverts shift_row.
No more process is needed for this job because the second s-box is
the same as the AES decryption s-box.
To calculate the third s-box, it uses the aesenclast,
then affine transformation, which is combined AES inverse affine and
ARIA S2.
To calculate the last s-box, it uses the aesdeclast,
then affine transformation, which is combined X2 and AES forward affine.
The optimized third and last s-box logic and GFNI s-box logic are
implemented by Jussi Kivilinna.
The aria-generic implementation is based on a 32-bit implementation,
not an 8-bit implementation. the aria-avx Diffusion Layer implementation
is based on aria-generic implementation because 8-bit implementation is
not fit for parallel implementation but 32-bit is enough to fit for this.
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|