Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8302007

Improve ChaCha20 intrinsics for single-part encryption/decryption on AVX-512



    • Bug
    • Resolution: Unresolved
    • P4
    • tbd
    • 20
    • security-libs
    • None


      The AVX-512 intrinsics for ChaCha20 maximize the keystream output by utilizing multiple sets of 4 registers at 512-bit widths. This yields 1024 bytes of output per call into the intrinsic-enabled method. While this yields significant performance gains over AVX2 and pure Java implementations at larger sizes, it can perform slower when the job size is small.

      For single-part encryption or decryption operations where the job size is <= 512 bytes, an implementation using AVX2 opcodes will outperform the AVX-512 version. This appears to be due in part to: 1) Fewer register sets being used to generate keystream and therefore faster reads/writes between registers and memory. 2) Possibly lower latency/faster instructions being employed.

      An implementation using AVX2 yields 256 bytes of keystream per call. When a job only requires 512 bytes or less, the time required to perform two keystream operations in AVX2 is still less than the 1024-byte output of a full 4-register set AVX-512 call. Once a third execution of an AVX2 implementation is needed, the AVX-512 single call is faster.

      Some additional benchmarking will be required to evaluate whether AVX2 with 2 register sets is faster than a single register set using AVX-512.


        Issue Links



              jnimeh Jamil Nimeh
              jnimeh Jamil Nimeh
              0 Vote for this issue
              1 Start watching this issue