-
Enhancement
-
Resolution: Fixed
-
P4
-
None
-
b09
-
aarch64
-
generic
On aarch64, the original implementation of the ChaCha20 block function used a block-parallel approach (a single 32-bit state integer was duplicated onto all lanes of a SIMD register, one register per state element), while the x86_64 implementation followed a quarter-round parallel approach (each 128-bit segment of the 512-bit state is held on 4 contiguous SIMD registers).
Profiling just the keystream generation function in assembly on aarch64 shows roughly an 11% speed gain using the quarter-round parallel version over the block-parallel. When placed into an intrinsic and used for a complete ChaCha20 encryption or decryption operation, the speed gains suggest a modest 2-4% speed increase, depending on the input size.
The plan is to move to the quarter-round parallel implementation in order to take advantage of this speed increase.
Profiling just the keystream generation function in assembly on aarch64 shows roughly an 11% speed gain using the quarter-round parallel version over the block-parallel. When placed into an intrinsic and used for a complete ChaCha20 encryption or decryption operation, the speed gains suggest a modest 2-4% speed increase, depending on the input size.
The plan is to move to the quarter-round parallel implementation in order to take advantage of this speed increase.
- causes
-
JDK-8350126 Regression ~3% on Crypto-ChaCha20Poly1305.encrypt for MacOSX aarch64
-
- In Progress
-
- links to
-
Commit(master) openjdk/jdk/ee4caa41
-
Review(master) openjdk/jdk/23397