Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8302524

Performance regression for float/double modulo operation

XMLWordPrintable

    • generic
    • generic

      ADDITIONAL SYSTEM INFORMATION :
      Linux, any vendor, any OpenJDK (tested 8 and 21=trunk)

      A DESCRIPTION OF THE PROBLEM :
      The peformance regression depends on GCC version being used. It has regressed between gcc-4.8 and gcc-4.9:
      [PATCH, i386]: Enable reminder{sd,df,xf} and fmod{sf,df,xf} only for flag_finite_math_only.
       = https://gcc.gnu.org/pipermail/gcc-patches/2014-September/400104.html
      https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=93ba85fdd253b4b9cf2b9e54e8e5969b1a3db098
      Java modulo (%) is compiled into Java bytecode drem which is defined as C fmod() - not C drem() (which is also named as remainder()). So for C/C++ function fmod:
       * gcc-4.8 was using fast CPU instruction fprem, only if it had non-finite result it falled back to glibc function fmod()
       * gcc-4.9 started using the fast CPU instruction fprem only with -ffinite-math-only (which is also a part of a more common -ffast-math). -ffinite-math-only has other effects on the code (such as isinf() no longer working) so this optimization is not really usable.
      According to the following info Java bytecode drem behavior matches the CPU instruction fprem so OpenJDK can use it directly:
       * https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-6.html#jvms-6.5.drem
       * https://community.intel.com/legacyfs/online/drupal_files/managed/a4/60/325383-sdm-vol-2abcd.pdf#page=483
      The following 3 issues are useful for upstream Linux components but they are not required for OpenJDK/Zulu:
       * glibc implementation fmod() is not using the fprem instruction. I do not really understand why, I consider it as a missed optimization.
       * gcc could also use the fprem instruction instead of the glibc call fmod(). Even gcc-4.8 had the fmod() callback for non-finite numbers which I do not understand why it was there.
       * clang does not have any fprem instruction optimization, it only calls glibc fmod().
      The patch does fix the performance and the patch is applicable for both OpenJDK-8 and OpenJDK trunk (and I expect anything in between). I see no regression on OpenJDK-8 Linux x86_64.
      I did not test Oracle Java 8 whether it was faster or not, it depends which compiler was Oracle using.
      It has regressed for example from:
        CentOS-7.1
        java-1.8.0-openjdk-1.8.0.31-2.b13.el7.x86_64
        GNU C 4.8.3 20140911 (Red Hat 4.8.3-9) -mtune=generic -march=x86-64 -g -O3 -fno-omit-frame-pointer -fstack-protector-strong -fno-strict-aliasing -fPIC
        gcc-4.8.3-9.el7.src.rpm does not yet contain the problematic patch
      to:
        CentOS-7.9
        java-1.8.0-openjdk-1.8.0.362.b08-1.el7_9.x86_64
        GNU C++ 4.8.5 20150623 (Red Hat 4.8.5-44) -m64 -mtune=generic -march=x86-64 -g -g -O3 -std=gnu++98 -fPIC -fno-rtti -fno-exceptions -fcheck-new -fvisibility=hidden -fno-strict-aliasing -fno-omit-frame-pointer -fstack-protector -fstack-protector-strong -fpch-deps --param ssp-buffer-size=4
        gcc-4.8.5-44.el7 already contains the problematic patch


      REGRESSION : Last worked in version 8

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      wget https://jankratochvil.net/t/DivisionDemo.java https://jankratochvil.net/t/benchmark.sh
      # edit old= and new= in benchmark.sh
      bash benchmark.sh


      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      JVM version: 1.8.0_31
      Iteration 0 regression case Took : 92 noMod case took: 63 noPower case took: 70
      Iteration 1 regression case Took : 89 noMod case took: 63 noPower case took: 69
      Iteration 2 regression case Took : 62 noMod case took: 63 noPower case took: 70
      Iteration 3 regression case Took : 62 noMod case took: 63 noPower case took: 70
      Iteration 4 regression case Took : 62 noMod case took: 63 noPower case took: 70
      Iteration 5 regression case Took : 65 noMod case took: 63 noPower case took: 70
      Iteration 6 regression case Took : 63 noMod case took: 63 noPower case took: 69
      Iteration 7 regression case Took : 63 noMod case took: 63 noPower case took: 69
      Iteration 8 regression case Took : 62 noMod case took: 64 noPower case took: 69
      Iteration 9 regression case Took : 62 noMod case took: 64 noPower case took: 69
       - each line contains about the same 3 numbers

      ACTUAL -
      JVM version: 1.8.0_362
      Iteration 0 regression case Took : 472 noMod case took: 63 noPower case took: 98
      Iteration 1 regression case Took : 465 noMod case took: 63 noPower case took: 96
      Iteration 2 regression case Took : 462 noMod case took: 42 noPower case took: 95
      Iteration 3 regression case Took : 458 noMod case took: 38 noPower case took: 106
      Iteration 4 regression case Took : 470 noMod case took: 63 noPower case took: 96
      Iteration 5 regression case Took : 465 noMod case took: 63 noPower case took: 102
      Iteration 6 regression case Took : 465 noMod case took: 63 noPower case took: 96
      Iteration 7 regression case Took : 465 noMod case took: 63 noPower case took: 97
      Iteration 8 regression case Took : 465 noMod case took: 63 noPower case took: 96
      Iteration 9 regression case Took : 457 noMod case took: 39 noPower case took: 85
       - the first test of modulo is up to 7x slower


      ---------- BEGIN SOURCE ----------
      https://jankratochvil.net/t/DivisionDemo.java
      https://jankratochvil.net/t/benchmark.sh
      This reproducer was not written by me.

      ---------- END SOURCE ----------

      CUSTOMER SUBMITTED WORKAROUND :
      https://jankratochvil.net/t/openjdk-asm.patch
      It could be also fixed either in GCC or in glibc (or both).


      FREQUENCY : always


            Unassigned Unassigned
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: