Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-5005861

native libraries for java.lang.Math should be non-strict

XMLWordPrintable

    • x86
    • windows_2000, windows_xp

      Name: rmT116609 Date: 03/01/2004


      A DESCRIPTION OF THE REQUEST :
      The transcedental functions in java.lang.Math, like trigonometric or
      exponential methods, have dismal performance when compared to similar
      functions from C/C++ libraries (or virtually any compiled language).
      Java's performance is limited by the stringent requirements of IEEE
      adherence, but this should impose only a modest overhead over other
      languages. Investigating the SCSL sources, it's easy to see that the root of
      all slowness lies in the fldlibm native library, which implements the
      native methods from java.lang.StrictMath with an approach that compares
      to emulated FP; that is, the libray doesn't exploit the FP opcodes of
      FPUs, not even for the "kernel" of the calculation that FPUs could do.

      For example, java.lang.StrictMath.tan() calls a JNI stub, that calls
      s_tan.c:tan(), that calls k_tan.c:__kernel_tan(). These two C functions
      implement the tangent with a complicated algorithm that never makes use
      of the FPU's tangent opcode (e.g., FTAN in Intel chips). This doesn't make
      the algorithm FPU-free, because the C compiler will generate the
      architecture-specific opcodes for fundamental operations used in fldlibm,
      like +,-,*,/; it's only the critical FTAN instruction that's avoided, but that's the
      one instruction that could replace 90% of the C code -- and run one order
      of magnitude faster than that code. In fact, debugging inside a C/C++
      compiler's equivalent tan() library function, it's possible to see that FTAN
      does all the hard work (with a few extra instructions to load arguments or
      handle special cases like infinites).

      This approach in fldlibm is probably the only way to implement "strict"
      maths, so the results are bit-per-bit identical in all platforms; the
      microcode in FPUs often use simpler algorithms than those in fldlibm,
      so they may produce less precise results or not obey all requisites from
      the Java spec, such as semi-monotonicity.

      Unfortunately, all maths are strict in Java, because the transcedental
      methods in java.lang.Math are all stubs that invoke the same method in
      java.lang.StrictMath. I can understand the tradeoff in StrictMath, but
      in the standard ("loose") Math class, we should favor best performance.
      Notice that most semantics dictated by the specification of the functions
      (i.e., support for all special values) could still be implemented with
      checks performed before invoking the FPU instruction. It's possible
      [I'm not a Math expert] that 1ulp precision and semi-monotonicity cannot
      be implemented easily (in the general case (*)) around the FPU opcode;
      but if this is a problem, the spec should be fixed; these requirements
      clearly exceed the needs of loose maths.

      In a similar fashion, the methods min() and max() should be alleviated
      from the demands of supporting NaN and signed zeros; due to this,
      Java's min()/max() are a lot slower than in most other languages (where
      similar functions are defined trivially, as "x >= y ? x : y"). In fact, most Java developers completely ignore the overheads of min()/max() because such
      methods are so common in all languages, that a new comer to Java would
      hardly care to read the docs or the sources (what could be simpler than
      min and max?). But min/max can be critical to some tight loops (e.g. DSP).
      Once again, I think java.lang.StrictMath is okay (it's great to have a
      max() that reports that +0 is bigger than -0, if I ever need this); but in
      java.lang.Math, the spec should enable the trivial implementation.

      (*) When the FPU can do calculations with extended precision, i.e. in the
      Pentium: using 64-bit registers for floats or 80-bit registers for doubles,
      it's highly likely that the results of transcedental opcodes will be exact at
      least to the last bit used by the Java type (32 for floats, 64 for doubles).
      This means that the current requisites of precision and semi-monotonicity
      would certainly be satisfied, and only bit-per-bit reproducibility of results
      might be lost, so this optimization could be performed without any spec
      changes at least in architectures supporting extended-precision FP.
      Even in platforms not supporting FP numbers larger than 64 bits, the
      optimization could be done for floats at the very least.

      JUSTIFICATION :
      Many years have gone by since the introduction of StrictMath/strictfp,
      so the people who really care about such wizardry as signed zeroes or
      multiplatform-reproducible results are either using StrictMath explicitly,
      or deserving any problem they could have with the relaxation of the loose
      methods. On the other hand, the people who don't need those advanced
      features but need maximum FP performance currently have no choice
      (except for custom math libs, and most would require custom JNI code).
      If there is a risk that further relaxing of java.lang.Math could break some
      apps (that *shouldn't* be using this class in the first place), this is certainly
      a lesser problem than the severe hit currently imposed for those who want
      to code scientific and engineering apps, or advanced 3D games, in Java.



      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      Performance should bw similar to C/C++; for example, benchmarking tan()
      shows a performance of 12ns per call (VC++ 2003, PentiumIV-2,25GHz)
      ACTUAL -
      The Java version of this benchmark shows 148ns per call to tan() (either the
      java.lang.Math or java.lang.StrictMath). This is 12X slower (HotSpot 1.4.2 -server -Xbatch). The same problem hapopens with several methods in java.lang.Math.

      ---------- BEGIN SOURCE ----------
      Java benchmark:

      public class Tan {
       static void benchLoose () {
        long t = System.currentTimeMillis();
        for (double i = 0; i < 10000000.0; ++i) Math.tan(i);
         System.out.println("Loose: "+(System.currentTimeMillis()-t)/100+"ns/call");
       }
       static strictfp void benchStrict () {
        long t = System.currentTimeMillis();
        for (double i = 0; i < 10000000.0; ++i) StrictMath.tan(i);
         System.out.println("Strict: "+(System.currentTimeMillis()-t)/100+"ns/call");
       }
       public static void main (String[] args) {
        while (true) {
         benchLoose();
         benchStrict();
        }
       }
      }

      C++ benchmark:

      #include <cmath>
      #include <cstdio>
      #include <ctime>
      void bench () {
       clock_t t = clock();
       for (double i = 0; i < 10000000.0; ++i) tan(i);
       printf("C++: %fns/call\n", (clock() - t) / 100.0);
      }
      void main (char** args) {
       while (true)
        bench();
      }

      Note: even though both are microbenchmarks and I test with full optimization,
      I have analysed the generated code to be sure that tan() is being called.
      ---------- END SOURCE ----------
      (Incident Review ID: 234064)
      ======================================================================

            darcy Joe Darcy
            rmandalasunw Ranjith Mandala (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: