Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8285871

Math.multiplyHigh and multiply on same inputs can be computed faster if their computation is shared

    XMLWordPrintable

Details

    Description

      Occasionally a hardware instruction produces two results, both of which is used by Java, but the JIT may execute the same instruction twice (on the same inputs) once for each result. This is true on x86 for both multiply and divide instructions.

      Under the flag UseDivMod we perform an extra pass to collect both results of divide instructions, if the IR in fact requests them.

      Here's the RFE: We should do the same for x86 multiply instructions as well. The hardware routinely produces a full 128 bit signed or unsigned result, only half of which is delivered. The same pass that looks for div/mod pairs should also look for multiply/multiply-high pairs (both signed and unsigned) and schedule just one multiply to produce both answers.

      For the record, ARM64 has separate instructions MUL and SMULH/UMULH so this is an Intel-specific RFE.

      Suggestion: This is probably not the last time that a hardware instruction produces 128 bits of result. The current UseDivMod logic should be extended, not just to capture MUL/MULH pairs, but also prepare to capture the next one as well. Likely future candidates are carryless multiply (produces 127 bits of output) and AES steps (produce 128 bits of output). It would be extremely reasonable for JDK API points for these functions to come in pairs, one to produce the high half and one to produce the low half, with the expectation that the JIT would avoid duplicate work by "sewing together" the pairs in the generated code.

      (This tactic would also work for 128-bit non-atomic loads, which may be important later. It doesn't work as well for 128-bit atomic loads, sadly.)

      Of course providing such methods in pairs is inferior to having them produce single results in tuple format, such as a `new long[2]` (with vigorous escape analysis) or some Valhalla value type like `Long128`. But it is quite likely that the easiest way to engineer such tuple-returning methods is to implement them with JDK-private primitives that are organized as method pairs, and are routinely sewn together by the JIT.

      In the JIT, it doesn't matter whether an intrinsic method is JDK-private or public. Bottom line: We should generalize the UseDivMod hack into something we can readily replicate in the future as new 128-bit intrinsics appear.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jrose John Rose
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: