-
Enhancement
-
Resolution: Won't Fix
-
P4
-
None
-
1.4.2, 5.0, 6
-
x86
-
windows_2000, windows_xp
Name: rmT116609 Date: 03/01/2004
A DESCRIPTION OF THE REQUEST :
The transcedental functions in java.lang.Math, like trigonometric or
exponential methods, have dismal performance when compared to similar
functions from C/C++ libraries (or virtually any compiled language).
Java's performance is limited by the stringent requirements of IEEE
adherence, but this should impose only a modest overhead over other
languages. Investigating the SCSL sources, it's easy to see that the root of
all slowness lies in the fldlibm native library, which implements the
native methods from java.lang.StrictMath with an approach that compares
to emulated FP; that is, the libray doesn't exploit the FP opcodes of
FPUs, not even for the "kernel" of the calculation that FPUs could do.
For example, java.lang.StrictMath.tan() calls a JNI stub, that calls
s_tan.c:tan(), that calls k_tan.c:__kernel_tan(). These two C functions
implement the tangent with a complicated algorithm that never makes use
of the FPU's tangent opcode (e.g., FTAN in Intel chips). This doesn't make
the algorithm FPU-free, because the C compiler will generate the
architecture-specific opcodes for fundamental operations used in fldlibm,
like +,-,*,/; it's only the critical FTAN instruction that's avoided, but that's the
one instruction that could replace 90% of the C code -- and run one order
of magnitude faster than that code. In fact, debugging inside a C/C++
compiler's equivalent tan() library function, it's possible to see that FTAN
does all the hard work (with a few extra instructions to load arguments or
handle special cases like infinites).
This approach in fldlibm is probably the only way to implement "strict"
maths, so the results are bit-per-bit identical in all platforms; the
microcode in FPUs often use simpler algorithms than those in fldlibm,
so they may produce less precise results or not obey all requisites from
the Java spec, such as semi-monotonicity.
Unfortunately, all maths are strict in Java, because the transcedental
methods in java.lang.Math are all stubs that invoke the same method in
java.lang.StrictMath. I can understand the tradeoff in StrictMath, but
in the standard ("loose") Math class, we should favor best performance.
Notice that most semantics dictated by the specification of the functions
(i.e., support for all special values) could still be implemented with
checks performed before invoking the FPU instruction. It's possible
[I'm not a Math expert] that 1ulp precision and semi-monotonicity cannot
be implemented easily (in the general case (*)) around the FPU opcode;
but if this is a problem, the spec should be fixed; these requirements
clearly exceed the needs of loose maths.
In a similar fashion, the methods min() and max() should be alleviated
from the demands of supporting NaN and signed zeros; due to this,
Java's min()/max() are a lot slower than in most other languages (where
similar functions are defined trivially, as "x >= y ? x : y"). In fact, most Java developers completely ignore the overheads of min()/max() because such
methods are so common in all languages, that a new comer to Java would
hardly care to read the docs or the sources (what could be simpler than
min and max?). But min/max can be critical to some tight loops (e.g. DSP).
Once again, I think java.lang.StrictMath is okay (it's great to have a
max() that reports that +0 is bigger than -0, if I ever need this); but in
java.lang.Math, the spec should enable the trivial implementation.
(*) When the FPU can do calculations with extended precision, i.e. in the
Pentium: using 64-bit registers for floats or 80-bit registers for doubles,
it's highly likely that the results of transcedental opcodes will be exact at
least to the last bit used by the Java type (32 for floats, 64 for doubles).
This means that the current requisites of precision and semi-monotonicity
would certainly be satisfied, and only bit-per-bit reproducibility of results
might be lost, so this optimization could be performed without any spec
changes at least in architectures supporting extended-precision FP.
Even in platforms not supporting FP numbers larger than 64 bits, the
optimization could be done for floats at the very least.
JUSTIFICATION :
Many years have gone by since the introduction of StrictMath/strictfp,
so the people who really care about such wizardry as signed zeroes or
multiplatform-reproducible results are either using StrictMath explicitly,
or deserving any problem they could have with the relaxation of the loose
methods. On the other hand, the people who don't need those advanced
features but need maximum FP performance currently have no choice
(except for custom math libs, and most would require custom JNI code).
If there is a risk that further relaxing of java.lang.Math could break some
apps (that *shouldn't* be using this class in the first place), this is certainly
a lesser problem than the severe hit currently imposed for those who want
to code scientific and engineering apps, or advanced 3D games, in Java.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Performance should bw similar to C/C++; for example, benchmarking tan()
shows a performance of 12ns per call (VC++ 2003, PentiumIV-2,25GHz)
ACTUAL -
The Java version of this benchmark shows 148ns per call to tan() (either the
java.lang.Math or java.lang.StrictMath). This is 12X slower (HotSpot 1.4.2 -server -Xbatch). The same problem hapopens with several methods in java.lang.Math.
---------- BEGIN SOURCE ----------
Java benchmark:
public class Tan {
static void benchLoose () {
long t = System.currentTimeMillis();
for (double i = 0; i < 10000000.0; ++i) Math.tan(i);
System.out.println("Loose: "+(System.currentTimeMillis()-t)/100+"ns/call");
}
static strictfp void benchStrict () {
long t = System.currentTimeMillis();
for (double i = 0; i < 10000000.0; ++i) StrictMath.tan(i);
System.out.println("Strict: "+(System.currentTimeMillis()-t)/100+"ns/call");
}
public static void main (String[] args) {
while (true) {
benchLoose();
benchStrict();
}
}
}
C++ benchmark:
#include <cmath>
#include <cstdio>
#include <ctime>
void bench () {
clock_t t = clock();
for (double i = 0; i < 10000000.0; ++i) tan(i);
printf("C++: %fns/call\n", (clock() - t) / 100.0);
}
void main (char** args) {
while (true)
bench();
}
Note: even though both are microbenchmarks and I test with full optimization,
I have analysed the generated code to be sure that tan() is being called.
---------- END SOURCE ----------
(Incident Review ID: 234064)
======================================================================
A DESCRIPTION OF THE REQUEST :
The transcedental functions in java.lang.Math, like trigonometric or
exponential methods, have dismal performance when compared to similar
functions from C/C++ libraries (or virtually any compiled language).
Java's performance is limited by the stringent requirements of IEEE
adherence, but this should impose only a modest overhead over other
languages. Investigating the SCSL sources, it's easy to see that the root of
all slowness lies in the fldlibm native library, which implements the
native methods from java.lang.StrictMath with an approach that compares
to emulated FP; that is, the libray doesn't exploit the FP opcodes of
FPUs, not even for the "kernel" of the calculation that FPUs could do.
For example, java.lang.StrictMath.tan() calls a JNI stub, that calls
s_tan.c:tan(), that calls k_tan.c:__kernel_tan(). These two C functions
implement the tangent with a complicated algorithm that never makes use
of the FPU's tangent opcode (e.g., FTAN in Intel chips). This doesn't make
the algorithm FPU-free, because the C compiler will generate the
architecture-specific opcodes for fundamental operations used in fldlibm,
like +,-,*,/; it's only the critical FTAN instruction that's avoided, but that's the
one instruction that could replace 90% of the C code -- and run one order
of magnitude faster than that code. In fact, debugging inside a C/C++
compiler's equivalent tan() library function, it's possible to see that FTAN
does all the hard work (with a few extra instructions to load arguments or
handle special cases like infinites).
This approach in fldlibm is probably the only way to implement "strict"
maths, so the results are bit-per-bit identical in all platforms; the
microcode in FPUs often use simpler algorithms than those in fldlibm,
so they may produce less precise results or not obey all requisites from
the Java spec, such as semi-monotonicity.
Unfortunately, all maths are strict in Java, because the transcedental
methods in java.lang.Math are all stubs that invoke the same method in
java.lang.StrictMath. I can understand the tradeoff in StrictMath, but
in the standard ("loose") Math class, we should favor best performance.
Notice that most semantics dictated by the specification of the functions
(i.e., support for all special values) could still be implemented with
checks performed before invoking the FPU instruction. It's possible
[I'm not a Math expert] that 1ulp precision and semi-monotonicity cannot
be implemented easily (in the general case (*)) around the FPU opcode;
but if this is a problem, the spec should be fixed; these requirements
clearly exceed the needs of loose maths.
In a similar fashion, the methods min() and max() should be alleviated
from the demands of supporting NaN and signed zeros; due to this,
Java's min()/max() are a lot slower than in most other languages (where
similar functions are defined trivially, as "x >= y ? x : y"). In fact, most Java developers completely ignore the overheads of min()/max() because such
methods are so common in all languages, that a new comer to Java would
hardly care to read the docs or the sources (what could be simpler than
min and max?). But min/max can be critical to some tight loops (e.g. DSP).
Once again, I think java.lang.StrictMath is okay (it's great to have a
max() that reports that +0 is bigger than -0, if I ever need this); but in
java.lang.Math, the spec should enable the trivial implementation.
(*) When the FPU can do calculations with extended precision, i.e. in the
Pentium: using 64-bit registers for floats or 80-bit registers for doubles,
it's highly likely that the results of transcedental opcodes will be exact at
least to the last bit used by the Java type (32 for floats, 64 for doubles).
This means that the current requisites of precision and semi-monotonicity
would certainly be satisfied, and only bit-per-bit reproducibility of results
might be lost, so this optimization could be performed without any spec
changes at least in architectures supporting extended-precision FP.
Even in platforms not supporting FP numbers larger than 64 bits, the
optimization could be done for floats at the very least.
JUSTIFICATION :
Many years have gone by since the introduction of StrictMath/strictfp,
so the people who really care about such wizardry as signed zeroes or
multiplatform-reproducible results are either using StrictMath explicitly,
or deserving any problem they could have with the relaxation of the loose
methods. On the other hand, the people who don't need those advanced
features but need maximum FP performance currently have no choice
(except for custom math libs, and most would require custom JNI code).
If there is a risk that further relaxing of java.lang.Math could break some
apps (that *shouldn't* be using this class in the first place), this is certainly
a lesser problem than the severe hit currently imposed for those who want
to code scientific and engineering apps, or advanced 3D games, in Java.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Performance should bw similar to C/C++; for example, benchmarking tan()
shows a performance of 12ns per call (VC++ 2003, PentiumIV-2,25GHz)
ACTUAL -
The Java version of this benchmark shows 148ns per call to tan() (either the
java.lang.Math or java.lang.StrictMath). This is 12X slower (HotSpot 1.4.2 -server -Xbatch). The same problem hapopens with several methods in java.lang.Math.
---------- BEGIN SOURCE ----------
Java benchmark:
public class Tan {
static void benchLoose () {
long t = System.currentTimeMillis();
for (double i = 0; i < 10000000.0; ++i) Math.tan(i);
System.out.println("Loose: "+(System.currentTimeMillis()-t)/100+"ns/call");
}
static strictfp void benchStrict () {
long t = System.currentTimeMillis();
for (double i = 0; i < 10000000.0; ++i) StrictMath.tan(i);
System.out.println("Strict: "+(System.currentTimeMillis()-t)/100+"ns/call");
}
public static void main (String[] args) {
while (true) {
benchLoose();
benchStrict();
}
}
}
C++ benchmark:
#include <cmath>
#include <cstdio>
#include <ctime>
void bench () {
clock_t t = clock();
for (double i = 0; i < 10000000.0; ++i) tan(i);
printf("C++: %fns/call\n", (clock() - t) / 100.0);
}
void main (char** args) {
while (true)
bench();
}
Note: even though both are microbenchmarks and I test with full optimization,
I have analysed the generated code to be sure that tan() is being called.
---------- END SOURCE ----------
(Incident Review ID: 234064)
======================================================================
- relates to
-
JDK-5004907 StrictMath.log() needs to be intrinsified
- Resolved
-
JDK-4857011 Performance regression in trigonometic functions (very slow StrictMath)
- Closed
-
JDK-5108893 Math.abs() is slow
- Resolved
-
JDK-4807358 REGRESSION: Performance regression in trigonometric functions
- Closed