Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8349106

Change ChaCha20 intrinsic to use quarter-round parallel implementation on aarch64

XMLWordPrintable

    • b09
    • aarch64
    • generic

        On aarch64, the original implementation of the ChaCha20 block function used a block-parallel approach (a single 32-bit state integer was duplicated onto all lanes of a SIMD register, one register per state element), while the x86_64 implementation followed a quarter-round parallel approach (each 128-bit segment of the 512-bit state is held on 4 contiguous SIMD registers).

        Profiling just the keystream generation function in assembly on aarch64 shows roughly an 11% speed gain using the quarter-round parallel version over the block-parallel. When placed into an intrinsic and used for a complete ChaCha20 encryption or decryption operation, the speed gains suggest a modest 2-4% speed increase, depending on the input size.

        The plan is to move to the quarter-round parallel implementation in order to take advantage of this speed increase.

              jnimeh Jamil Nimeh
              jnimeh Jamil Nimeh
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: