-
Enhancement
-
Resolution: Unresolved
-
P4
-
20
-
None
-
x86_64
The AVX-512 intrinsics for ChaCha20 maximize the keystream output by utilizing multiple sets of 4 registers at 512-bit widths. This yields 1024 bytes of output per call into the intrinsic-enabled method. While this yields significant performance gains over AVX2 and pure Java implementations at larger sizes, it can perform slower when the job size is small.
For single-part encryption or decryption operations where the job size is <= 512 bytes, an implementation using AVX2 opcodes will outperform the AVX-512 version. This appears to be due in part to: 1) Fewer register sets being used to generate keystream and therefore faster reads/writes between registers and memory. 2) Possibly lower latency/faster instructions being employed.
An implementation using AVX2 yields 256 bytes of keystream per call. When a job only requires 512 bytes or less, the time required to perform two keystream operations in AVX2 is still less than the 1024-byte output of a full 4-register set AVX-512 call. Once a third execution of an AVX2 implementation is needed, the AVX-512 single call is faster.
Some additional benchmarking will be required to evaluate whether AVX2 with 2 register sets is faster than a single register set using AVX-512.
For single-part encryption or decryption operations where the job size is <= 512 bytes, an implementation using AVX2 opcodes will outperform the AVX-512 version. This appears to be due in part to: 1) Fewer register sets being used to generate keystream and therefore faster reads/writes between registers and memory. 2) Possibly lower latency/faster instructions being employed.
An implementation using AVX2 yields 256 bytes of keystream per call. When a job only requires 512 bytes or less, the time required to perform two keystream operations in AVX2 is still less than the 1024-byte output of a full 4-register set AVX-512 call. Once a third execution of an AVX2 implementation is needed, the AVX-512 single call is faster.
Some additional benchmarking will be required to evaluate whether AVX2 with 2 register sets is faster than a single register set using AVX-512.
- relates to
-
JDK-8247645 ChaCha20 Intrinsics
-
- Resolved
-