-
Type:
Enhancement
-
Resolution: Unresolved
-
Priority:
P4
-
Affects Version/s: 27
-
Component/s: hotspot
-
aarch64
-
generic
Problem:
--------
On certain AArch64 microarchitectures, the SVE `cpy` instruction with
zeroing predication may exhibit a dependency-tracking behavior that
prevents optimal instruction-level parallelism between adjacent loop
iterations when the same destination register is reused. This manifests
as a performance regression of **12%** to **100%** in specific Java
Vector API micro-benchmarks.
Specifically, the following pattern:
```
iter_n:
cpy z17.d, p1/z, #1 // Predicated zeroing move
iter_n+1:
cpy z17.d, p1/z, #1 // Same destination register
```
The microarchitecture's register renaming logic may conservatively
treat the destination register as having a read-after-write dependency
on the previous value, even though the zeroing predication semantically
does not require reading the old value. This behavior has been observed
and reproduced on Neoverse-V1 and Neoverse-V2 cores using both Java
microbenchmarks and standalone C reproducers.
Impact:
--------
Currently the instruction is used in code generated by
`VectorStoreMaskNode` and `VectorReinterpretNode`, affecting all Vector
APIs that generates these two IRs, such as `VectorMask.intoArray()` and
`Vector.toLong()`. Microbenchmark measurements show performance
degradation ranging from **12%** to **2x** depending on the specific
operation and data types involved.
--------
On certain AArch64 microarchitectures, the SVE `cpy` instruction with
zeroing predication may exhibit a dependency-tracking behavior that
prevents optimal instruction-level parallelism between adjacent loop
iterations when the same destination register is reused. This manifests
as a performance regression of **12%** to **100%** in specific Java
Vector API micro-benchmarks.
Specifically, the following pattern:
```
iter_n:
cpy z17.d, p1/z, #1 // Predicated zeroing move
iter_n+1:
cpy z17.d, p1/z, #1 // Same destination register
```
The microarchitecture's register renaming logic may conservatively
treat the destination register as having a read-after-write dependency
on the previous value, even though the zeroing predication semantically
does not require reading the old value. This behavior has been observed
and reproduced on Neoverse-V1 and Neoverse-V2 cores using both Java
microbenchmarks and standalone C reproducers.
Impact:
--------
Currently the instruction is used in code generated by
`VectorStoreMaskNode` and `VectorReinterpretNode`, affecting all Vector
APIs that generates these two IRs, such as `VectorMask.intoArray()` and
`Vector.toLong()`. Microbenchmark measurements show performance
degradation ranging from **12%** to **2x** depending on the specific
operation and data types involved.