-
Enhancement
-
Resolution: Fixed
-
P4
-
8, 11, 17, 20, 21
-
b16
-
x86_64
-
linux
-
Verified
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8307047 | 20u-cpu | Christoph Langer | P4 | Resolved | Fixed | master |
JDK-8305948 | 20.0.2 | Christoph Langer | P4 | Resolved | Fixed | b03 |
There was a performance degradation (about 6x slowdown) for float/double modulo operations in Java on Linux.
It happened and went unnoticed after a change in GCC between gcc-4.8 and gcc-4.9.
So, it is easy to compare performance of two separate builds of jdk8 built by different versions of GCC compiler.
The affected native hotspot code is the same even today. Applying the same fix as in jdk8 to the trunk (jdk 21) does show the problem (and solution) with all recent versions of gcc.
The gcc was slow since this commit (performance regression):
[PATCH, i386]: Enable reminder{sd,df,xf} and fmod{sf,df,xf} only for flag_finite_math_only.
= https://gcc.gnu.org/pipermail/gcc-patches/2014-September/400104.html
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=4f2611b6e872c40e0bf4da38ff05df8c8fe0ee64
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098 (backport)
The performance regression got fixed/reverted by this commit:
[PATCH] i386: Do not constrain fmod and remainder patterns with flag_finite_math_only [PR108922]
= https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612918.html
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=8020c9c42349f51f75239b9d35a2be41848a97bd
Attached are
* reproducer org.apache.spark.DivisionDemo.java;
* jdk8 timings with gcc-4.8 builds before/after remainder change in gcc;
* jdk8 timings with gcc-4.8 after change in gcc with a fix in hotspot;
* before/after the fix in hotspot timings for jdk 21 with gcc-12;
the fix applicable to all versions of jdk (with path adjustment for jdk8)
Reproducer should be run as
java -cp . -Xmx1024m -Xms1024m -XX:+AlwaysPreTouch org.apache.spark.DivisionDemo 10 f
with the last parameter f for float, d for double.
Analysis:
Java modulo (%) is compiled into Java bytecode drem which is defined as C fmod() - not C drem() (which is also named as remainder()). So for C/C++ function fmod:
* gcc-4.8 was using fast CPU instruction fprem, only if it had non-finite result it falled back to glibc function fmod()
* gcc-4.9 started using the fast CPU instruction fprem only with -ffinite-math-only (which is also a part of a more common -ffast-math). -ffinite-math-only has other effects on the code (such as isinf() no longer working) so this optimization is not really usable.
According to the following info Java bytecode drem behavior matches the CPU instruction fprem so OpenJDK can use it directly:
* https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.drem
* https://community.intel.com/legacyfs/online/drupal_files/managed/a4/60/325383-sdm-vol-2abcd.pdf#page=483
The following 3 issues are useful for upstream Linux components but they are not required for OpenJDK:
* glibc implementation fmod() is not using the fprem instruction. I do not really understand why, I consider it as a missed optimization.
* gcc could also use the fprem instruction instead of the glibc call fmod(). Even gcc-4.8 had the fmod() callback for non-finite numbers which I do not understand why it was there.
* clang does not have any fprem instruction optimization, it only calls glibc fmod().
The patch does fix the performance and the patch is applicable for both OpenJDK-8 and OpenJDK trunk (and I expect anything in between). I see no regression on OpenJDK-8 Linux x86_64.
It is hard to detect a regression with a performance fix, so noreg-perf.
- backported by
-
JDK-8305948 Performance degradation for float/double modulo on Linux
- Resolved
-
JDK-8307047 Performance degradation for float/double modulo on Linux
- Resolved
- duplicates
-
JDK-8302524 Performance regression for float/double modulo operation
- Closed
- relates to
-
JDK-8305689 Consider adding an intrinsic for StrictMath.IEEEremainder
- Open
-
JDK-8308966 Add intrinsic for float/double modulo for x86 AVX2 and AVX512
- Resolved
-
JDK-8312188 Performance regression in SharedRuntime::frem/drem() on non-Windows x86 after JDK-8302191
- Closed
-
JDK-8314056 Remove runtime platform check from frem/drem
- Resolved
- links to
-
Commit openjdk/jdk20u/e1746816
-
Commit openjdk/jdk/37774556
-
Review openjdk/jdk8u-dev/298
-
Review openjdk/jdk11u-dev/1824
-
Review openjdk/jdk17u-dev/1234
-
Review openjdk/jdk19u/108
-
Review openjdk/jdk20u/46
-
Review openjdk/jdk/12508