-
Bug
-
Resolution: Duplicate
-
P3
-
22
-
x86
There is a performance regression for non-AVX512 x86 systems after the integration of JDK-8308966 which intrinsifies float/double modulo. This can be observed/isolated by running Blender.java with flags to disable most of the C2 optimizations and only compiling test(). On AVX512 there is a small regression of 2-3% which might also be worth looking into. The regression can also be observed with the interpreter only by using -Xint:
Test on AVX2
========
Setup:
- AVX512 not available
- AVX2 available
- FMA instructions available
- fastdebug build
--- JDK 22+7/mainline ---
Default:
$ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java
Output:
2847 ms
2860 ms
2876 ms
2861 ms
2868 ms
2867 ms
2877 ms
2875 ms
2880 ms
2880 ms
Average: 2869 ms
Disabling FMA instruction with -XX:-UseFMA:
$ java -XX:-UseFMA -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java
Output:
329 ms
329 ms
330 ms
330 ms
331 ms
331 ms
332 ms
330 ms
332 ms
332 ms
Average: 330 ms
--- JDK 21+31 ---
$ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java
Output:
341 ms
340 ms
341 ms
341 ms
340 ms
341 ms
340 ms
341 ms
341 ms
341 ms
Average: 340 ms
-----> SUMMARY: ~9x regression in JDK 22 for AVX2 without AVX512
=== Interpreter only ===
--- JDK 22+7/mainline ---
Default:
$ java -Xint Blender.java
Output:
3311 ms
3310 ms
3314 ms
3324 ms
3320 ms
3333 ms
3343 ms
3350 ms
3343 ms
3336 ms
Average: 3328 ms
Disabling FMA instruction with -XX:-UseFMA:
$ java -XX:-UseFMA -Xint Blender.java
Output:
956 ms
877 ms
865 ms
886 ms
897 ms
917 ms
886 ms
876 ms
863 ms
903 ms
Average: 892 ms
--- JDK 21+31 ---
$ java -Xint Blender.java
Output:
917 ms
930 ms
951 ms
973 ms
941 ms
926 ms
948 ms
963 ms
971 ms
975 ms
Average: 949 ms
-----> SUMMARY: ~3x regression in JDK 22 for AVX2 without AVX512 with interpreter only
Test on AVX512
=========
Setup:
- AVX512 available where VM_Version::supports_avx512vlbwdq() is true
- fastdebug
--- JDK 22+7/mainline ---
Default:
$ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java
Output:
907 ms
907 ms
908 ms
907 ms
917 ms
908 ms
908 ms
908 ms
910 ms
907 ms
Average: 908 ms
--- JDK 21+31 ---
888 ms
884 ms
884 ms
884 ms
885 ms
884 ms
888 ms
890 ms
884 ms
884 ms
Average: 885 ms
-----> SUMMARY: ~2-3% regression in JDK 22 for AVX512
Test on AVX2
========
Setup:
- AVX512 not available
- AVX2 available
- FMA instructions available
- fastdebug build
--- JDK 22+7/mainline ---
Default:
$ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java
Output:
2847 ms
2860 ms
2876 ms
2861 ms
2868 ms
2867 ms
2877 ms
2875 ms
2880 ms
2880 ms
Average: 2869 ms
Disabling FMA instruction with -XX:-UseFMA:
$ java -XX:-UseFMA -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java
Output:
329 ms
329 ms
330 ms
330 ms
331 ms
331 ms
332 ms
330 ms
332 ms
332 ms
Average: 330 ms
--- JDK 21+31 ---
$ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java
Output:
341 ms
340 ms
341 ms
341 ms
340 ms
341 ms
340 ms
341 ms
341 ms
341 ms
Average: 340 ms
-----> SUMMARY: ~9x regression in JDK 22 for AVX2 without AVX512
=== Interpreter only ===
--- JDK 22+7/mainline ---
Default:
$ java -Xint Blender.java
Output:
3311 ms
3310 ms
3314 ms
3324 ms
3320 ms
3333 ms
3343 ms
3350 ms
3343 ms
3336 ms
Average: 3328 ms
Disabling FMA instruction with -XX:-UseFMA:
$ java -XX:-UseFMA -Xint Blender.java
Output:
956 ms
877 ms
865 ms
886 ms
897 ms
917 ms
886 ms
876 ms
863 ms
903 ms
Average: 892 ms
--- JDK 21+31 ---
$ java -Xint Blender.java
Output:
917 ms
930 ms
951 ms
973 ms
941 ms
926 ms
948 ms
963 ms
971 ms
975 ms
Average: 949 ms
-----> SUMMARY: ~3x regression in JDK 22 for AVX2 without AVX512 with interpreter only
Test on AVX512
=========
Setup:
- AVX512 available where VM_Version::supports_avx512vlbwdq() is true
- fastdebug
--- JDK 22+7/mainline ---
Default:
$ java -XX:-TieredCompilation -XX:LoopMaxUnroll=0 -XX:-DoEscapeAnalysis -XX:+UseParallelGC -XX:CompileCommand=compileonly,Blender::test Blender.java
Output:
907 ms
907 ms
908 ms
907 ms
917 ms
908 ms
908 ms
908 ms
910 ms
907 ms
Average: 908 ms
--- JDK 21+31 ---
888 ms
884 ms
884 ms
884 ms
885 ms
884 ms
888 ms
890 ms
884 ms
884 ms
Average: 885 ms
-----> SUMMARY: ~2-3% regression in JDK 22 for AVX512
- duplicates
-
JDK-8314056 Remove runtime platform check from frem/drem
- Resolved
- relates to
-
JDK-8312188 Performance regression in SharedRuntime::frem/drem() on non-Windows x86 after JDK-8302191
- Closed
-
JDK-8308966 Add intrinsic for float/double modulo for x86 AVX2 and AVX512
- Resolved