Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8256488

AArch64: Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

XMLWordPrintable

    • b27
    • aarch64

        Submitted by Evgeny Astigeevich (eastig@amazon.co.uk)

        When UseSIMDForMemoryOps is on on Graviton2, there are 27%-48% performance regressions of arraycopy microbenchmarks for 70-80 bytes copies. Analysis shows the problem code is generated in StubGenerator::copy_memory:

            if (UseSIMDForMemoryOps) {
              __ ld4(v0, v1, v2, v3, __ T16B, Address(s, 0));
              __ ldpq(v4, v5, Address(send, -32));
              __ st4(v0, v1, v2, v3, __ T16B, Address(d, 0));
              __ stpq(v4, v5, Address(dend, -32));
            } else {

        Using ldpq/stpq instead of ld4/st4 fixes the regressions. This follows what the Arm Optimization Guide, including for Neoverse N1, recommends: Use discrete, non-writeback forms of load and store instructions while interleaving them.

              simonis Volker Simonis
              simonis Volker Simonis
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: