Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8349106

Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64

XMLWordPrintable

    • b09
    • aarch64
    • generic

      On aarch64, the original implementation of the ChaCha20 block function used a block-parallel approach (a single 32-bit state integer was duplicated onto all lanes of a SIMD register, one register per state element), while the x86_64 implementation followed a quarter-round parallel approach (each 128-bit segment of the 512-bit state is held on 4 contiguous SIMD registers).

      Profiling just the keystream generation function in assembly on aarch64 shows roughly an 11% speed gain using the quarter-round parallel version over the block-parallel. When placed into an intrinsic and used for a complete ChaCha20 encryption or decryption operation, the speed gains suggest a modest 2-4% speed increase, depending on the input size.

      The plan is to move to the quarter-round parallel implementation in order to take advantage of this speed increase.

            jnimeh Jamil Nimeh
            jnimeh Jamil Nimeh
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: