[VectorAPI]: AArch64: Avoid using zeroing mode SVE CPY instruction

XMLWordPrintable

    • Type: Enhancement
    • Resolution: Unresolved
    • Priority: P4
    • tbd
    • Affects Version/s: 27
    • Component/s: hotspot
    • aarch64
    • generic

      Problem:
      --------
      On certain AArch64 microarchitectures, the SVE `cpy` instruction with
      zeroing predication may exhibit a dependency-tracking behavior that
      prevents optimal instruction-level parallelism between adjacent loop
      iterations when the same destination register is reused. This manifests
      as a performance regression of **12%** to **100%** in specific Java
      Vector API micro-benchmarks.

      Specifically, the following pattern:
      ```
      iter_n:
        cpy z17.d, p1/z, #1 // Predicated zeroing move

      iter_n+1:
        cpy z17.d, p1/z, #1 // Same destination register
      ```

      The microarchitecture's register renaming logic may conservatively
      treat the destination register as having a read-after-write dependency
      on the previous value, even though the zeroing predication semantically
      does not require reading the old value. This behavior has been observed
      and reproduced on Neoverse-V1 and Neoverse-V2 cores using both Java
      microbenchmarks and standalone C reproducers.

      Impact:
      --------
      Currently the instruction is used in code generated by
      `VectorStoreMaskNode` and `VectorReinterpretNode`, affecting all Vector
      APIs that generates these two IRs, such as `VectorMask.intoArray()` and
      `Vector.toLong()`. Microbenchmark measurements show performance
      degradation ranging from **12%** to **2x** depending on the specific
      operation and data types involved.

            Assignee:
            Eric Fang
            Reporter:
            Eric Fang
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: