Loading...

Type: Enhancement
Resolution: Fixed
Priority: P4
Fix Version/s: 21
Affects Version/s: 8, 11, 17, 20, 21
Component/s: hotspot
Labels:

Subcomponent:
runtime
Resolved In Build:
b16
CPU:

x86_64
OS:

linux
Verification:
Verified

Issue	Fix Version	Assignee	Priority	Status	Resolution	Resolved In Build
JDK-8307047	20u-cpu	Christoph Langer	P4	Resolved	Fixed	master
JDK-8305948	20.0.2	Christoph Langer	P4	Resolved	Fixed	b03

As reported by Jan Kratochvil:

There was a performance degradation (about 6x slowdown) for float/double modulo operations in Java on Linux.
It happened and went unnoticed after a change in GCC between gcc-4.8 and gcc-4.9.
So, it is easy to compare performance of two separate builds of jdk8 built by different versions of GCC compiler.

The affected native hotspot code is the same even today. Applying the same fix as in jdk8 to the trunk (jdk 21) does show the problem (and solution) with all recent versions of gcc.

The gcc was slow since this commit (performance regression):
[PATCH, i386]: Enable reminder{sd,df,xf} and fmod{sf,df,xf} only for flag_finite_math_only.
= https://gcc.gnu.org/pipermail/gcc-patches/2014-September/400104.html
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=4f2611b6e872c40e0bf4da38ff05df8c8fe0ee64
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098 (backport)

The performance regression got fixed/reverted by this commit:
[PATCH] i386: Do not constrain fmod and remainder patterns with flag_finite_math_only [PR108922]
= https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612918.html
https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=8020c9c42349f51f75239b9d35a2be41848a97bd

Attached are
* reproducer org.apache.spark.DivisionDemo.java;
* jdk8 timings with gcc-4.8 builds before/after remainder change in gcc;
* jdk8 timings with gcc-4.8 after change in gcc with a fix in hotspot;
* before/after the fix in hotspot timings for jdk 21 with gcc-12;

the fix applicable to all versions of jdk (with path adjustment for jdk8)

Reproducer should be run as
java -cp . -Xmx1024m -Xms1024m -XX:+AlwaysPreTouch org.apache.spark.DivisionDemo 10 f
with the last parameter f for float, d for double.

Analysis:

Java modulo (%) is compiled into Java bytecode drem which is defined as C fmod() - not C drem() (which is also named as remainder()). So for C/C++ function fmod:
* gcc-4.8 was using fast CPU instruction fprem, only if it had non-finite result it falled back to glibc function fmod()
* gcc-4.9 started using the fast CPU instruction fprem only with -ffinite-math-only (which is also a part of a more common -ffast-math). -ffinite-math-only has other effects on the code (such as isinf() no longer working) so this optimization is not really usable.

According to the following info Java bytecode drem behavior matches the CPU instruction fprem so OpenJDK can use it directly:
* https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.drem
* https://community.intel.com/legacyfs/online/drupal_files/managed/a4/60/325383-sdm-vol-2abcd.pdf#page=483

The following 3 issues are useful for upstream Linux components but they are not required for OpenJDK:
* glibc implementation fmod() is not using the fprem instruction. I do not really understand why, I consider it as a missed optimization.
* gcc could also use the fprem instruction instead of the glibc call fmod(). Even gcc-4.8 had the fmod() callback for non-finite numbers which I do not understand why it was there.
* clang does not have any fprem instruction optimization, it only calls glibc fmod().

The patch does fix the performance and the patch is applicable for both OpenJDK-8 and OpenJDK trunk (and I expect anything in between). I see no regression on OpenJDK-8 Linux x86_64.

It is hard to detect a regression with a performance fix, so noreg-perf.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

DivisionDemo.java
5 kB
2023-02-09 23:40
time.openjdk21-gcc1221
0.9 kB
2023-02-09 23:40
time.openjdk21-gcc1221fix
0.9 kB
2023-02-09 23:40
time.openjdk8-gcc48post
1.0 kB
2023-02-09 23:40
time.openjdk8-gcc48postfix
1.0 kB
2023-02-09 23:40
time.openjdk8-gcc48pre
1.0 kB
2023-02-09 23:40

backported by

JDK-8305948 Performance degradation for float/double modulo on Linux

Resolved

JDK-8307047 Performance degradation for float/double modulo on Linux

Resolved

causes

JDK-8349401 Performance regression in JDK 21 for double arithmetic operations

Closed

JDK-8308966 Add intrinsic for float/double modulo for x86 AVX2 and AVX512

Resolved

JDK-8314056 Remove runtime platform check from frem/drem

Resolved

duplicates

JDK-8302524 Performance regression for float/double modulo operation

Closed

relates to

JDK-8305689 Consider adding an intrinsic for StrictMath.IEEEremainder

Open

JDK-8308966 Add intrinsic for float/double modulo for x86 AVX2 and AVX512

Resolved

JDK-8312188 Performance regression in SharedRuntime::frem/drem() on non-Windows x86 after JDK-8302191

Closed

links to

Commit openjdk/jdk20u/e1746816

Commit openjdk/jdk/37774556

Review openjdk/jdk8u-dev/298

Review openjdk/jdk11u-dev/1824

Review openjdk/jdk17u-dev/1234

Review openjdk/jdk19u/108

Review openjdk/jdk20u/46

Review openjdk/jdk/12508

(1 duplicates, 3 relates to, 8 links to)

Details

Backports

Description

Attachments

Attachments

Issue Links

Activity

People

Dates