-
Enhancement
-
Resolution: Fixed
-
P4
-
11
-
b27
-
aarch64
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8257903 | 11.0.11 | Volker Simonis | P4 | Resolved | Fixed | b01 |
Submitted by Evgeny Astigeevich (eastig@amazon.co.uk)
When UseSIMDForMemoryOps is on on Graviton2, there are 27%-48% performance regressions of arraycopy microbenchmarks for 70-80 bytes copies. Analysis shows the problem code is generated in StubGenerator::copy_memory:
if (UseSIMDForMemoryOps) {
__ ld4(v0, v1, v2, v3, __ T16B, Address(s, 0));
__ ldpq(v4, v5, Address(send, -32));
__ st4(v0, v1, v2, v3, __ T16B, Address(d, 0));
__ stpq(v4, v5, Address(dend, -32));
} else {
Using ldpq/stpq instead of ld4/st4 fixes the regressions. This follows what the Arm Optimization Guide, including for Neoverse N1, recommends: Use discrete, non-writeback forms of load and store instructions while interleaving them.
When UseSIMDForMemoryOps is on on Graviton2, there are 27%-48% performance regressions of arraycopy microbenchmarks for 70-80 bytes copies. Analysis shows the problem code is generated in StubGenerator::copy_memory:
if (UseSIMDForMemoryOps) {
__ ld4(v0, v1, v2, v3, __ T16B, Address(s, 0));
__ ldpq(v4, v5, Address(send, -32));
__ st4(v0, v1, v2, v3, __ T16B, Address(d, 0));
__ stpq(v4, v5, Address(dend, -32));
} else {
Using ldpq/stpq instead of ld4/st4 fixes the regressions. This follows what the Arm Optimization Guide, including for Neoverse N1, recommends: Use discrete, non-writeback forms of load and store instructions while interleaving them.
- backported by
-
JDK-8257903 AArch64: Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory
- Resolved
- relates to
-
JDK-8257436 AArch64: Regressions in ArrayCopyUnalignedDst.testByte/testChar for 65-78 bytes when UseSIMDForMemoryOps is on
- Resolved
-
JDK-8255351 Add detection for Graviton 2 CPUs
- Resolved