-
Enhancement
-
Resolution: Unresolved
-
P4
-
26
In https://bugs.openjdk.org/browse/JDK-6912521, an optimization was added to inline individual load/stores for small arraycopy invocations. These loads and stores are 1 byte each. We could use word-sized load/stores where appropriate and increase the ArrayCopyLoadStoreMaxElem commensurately.
E.g. a 16-byte (aligned) arraycopy could be two inlined loads/stores, instead of calling the runtime stub.
A naive implementation gives me 5% improvement in SPECjvm crypto.signverify which does a lot of 16-byte copies. UseAVX=3 is faster due to https://bugs.openjdk.org/browse/JDK-8252848.
crypto.signverify on Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz:
UseAVX=3: 209.6 op/s (uses inlined vector masks fromJDK-8252848)
UseAVX=2: 198.4 op/s (calls arraycopy runtime stub)
UseAVX=2 ArrayCopyLoadStoreMaxElem=256: 203.3 op/s
UseAVX=2 (with 8-byte inlined load/stores): 209.3 op/s
E.g. a 16-byte (aligned) arraycopy could be two inlined loads/stores, instead of calling the runtime stub.
A naive implementation gives me 5% improvement in SPECjvm crypto.signverify which does a lot of 16-byte copies. UseAVX=3 is faster due to https://bugs.openjdk.org/browse/JDK-8252848.
crypto.signverify on Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz:
UseAVX=3: 209.6 op/s (uses inlined vector masks from
UseAVX=2: 198.4 op/s (calls arraycopy runtime stub)
UseAVX=2 ArrayCopyLoadStoreMaxElem=256: 203.3 op/s
UseAVX=2 (with 8-byte inlined load/stores): 209.3 op/s
- relates to
-
JDK-6912521 System.arraycopy works slower than the simple loop for little lengths
-
- Resolved
-