-
Enhancement
-
Resolution: Fixed
-
P4
-
11, 17, 18, 19
-
b05
-
x86_64
In copy_bytes_forward and copy_bytes_backward that are used in arraycopy stubs, we have code like:
if (UseAVX >= 2) {
// clean upper bits of YMM registers
__ vpxor(xmm0, xmm0);
__ vpxor(xmm1, xmm1);
}
This code was added byJDK-8011102 (with vzeroupper), and then changed by JDK-8078113 (changed to vpxor).
It raised some questions during earlyJDK-8279621 review.
I believe these were added to resolve false dependencies from larger 256-bit registers with subsequent 128-bit-using instructions.
Note: this is still insufficient on Intel x86 implementations to recover from "dirty" AVX state; only vzeroupper/vzeroall would solve that, but that issue might not even affect our assembler code that AFAICS uses VEX-encoded versions when AVX > 0 (see Assembler::simd_prefix_and_encode). Every arraycopy stub has vzeroupper at the end, anyhow.
For x86_64 version, this zeroing seems redundant, as there are no XMM-using instructions after we leave the copy_bytes_{forward,backward} and go to stub epilog, where we meet vzeroupper.
For x86_32 version, this zeroing seems odd. x86_32 qword copying still uses XMM registers, as 32-bit platform has no other good way to copy 8 bytes at a time. There, using VEX.256 vpxor clears all bits, which is fine right now, butJDK-8279621 changes would probably need to clear upper 128-bits in AVX=1 mode. Also, it is only enabled for AVX==2, ignoring AVX-512.
Draft PR:
https://github.com/openjdk/jdk/pull/7016
if (UseAVX >= 2) {
// clean upper bits of YMM registers
__ vpxor(xmm0, xmm0);
__ vpxor(xmm1, xmm1);
}
This code was added by
It raised some questions during early
I believe these were added to resolve false dependencies from larger 256-bit registers with subsequent 128-bit-using instructions.
Note: this is still insufficient on Intel x86 implementations to recover from "dirty" AVX state; only vzeroupper/vzeroall would solve that, but that issue might not even affect our assembler code that AFAICS uses VEX-encoded versions when AVX > 0 (see Assembler::simd_prefix_and_encode). Every arraycopy stub has vzeroupper at the end, anyhow.
For x86_64 version, this zeroing seems redundant, as there are no XMM-using instructions after we leave the copy_bytes_{forward,backward} and go to stub epilog, where we meet vzeroupper.
For x86_32 version, this zeroing seems odd. x86_32 qword copying still uses XMM registers, as 32-bit platform has no other good way to copy 8 bytes at a time. There, using VEX.256 vpxor clears all bits, which is fine right now, but
Draft PR:
https://github.com/openjdk/jdk/pull/7016
- blocks
-
JDK-8279621 x86_64 arraycopy stubs should use 256-bit copies with AVX=1
- Closed
- relates to
-
JDK-8178811 Minimize the AVX <-> SSE transition penalty through generation of vzeroupper instruction on x86
- Resolved
-
JDK-8078113 8011102 changes may cause incorrect results.
- Resolved
-
JDK-8011102 Clear AVX registers after return from JNI call
- Resolved
-
JDK-8279621 x86_64 arraycopy stubs should use 256-bit copies with AVX=1
- Closed
(2 links to)