Loading...

Type: Enhancement
Resolution: Unresolved
Priority: P4
Fix Version/s: tbd
Affects Version/s: None
Component/s: hotspot
Labels:
None

Subcomponent:
compiler

% on quad-returning intrinsics

Intrinsics are useful whenever there is an assembly language idiom that performs an important bit of work that C2 doesn't know how to build in IR (and shouldn't be taught, if it wouldn't be a gainful optimization).

And, yes, we want 128-bit intrinsics. Here are some 128-bit operations we want:

- Intel mulx (long mul+mulhi)
- long div/mul pairs (already covered?)
- quad integer arithmetic (various ops)
- transcendental function pairs (sincos)
- CLMUL (all modern platforms do this at speed)
- AES steps (same comment about modern platforms)
- other crypto ops (maybe, but not for actual crypto algorithms)

Many such operations cannot be readily expressed in Java code, which maxes out at `long` primitive values. And "let's wait for Valhalla" doesn't fix the problem; it just moves each problem into the heart of some value class. In fact, Valhalla increases demand for a design for 128-bit intrinsics.

## the problem of returns

A key blocker here is that we have no good architecture for dealing with 128-bit data items, _unless_ they are also stored in vector registers, _and_ on a platform that supports those. Both suppositions are faulty; we just want 128-bit data in general registers; then we can discuss whether they also go in vectors.

The biggest missing piece is no way to express reception in Java code of 128-bit values. We can pass them in as 64-bit value pairs; we just can't return them back, or receive them back from a call.

We don't want a one-shot ad hoc solution for this, just for (say) some particular multiply operation. We really need a design that scales to multiple use cases, like those listed above.

The problem is briefly discussed in https://bugs.openjdk.org/browse/JDK-8285871

For embedding intrinsic calls in Java code, there are two basic options:

\1. 2-Arrays: The intrinsic method returns its value into a `long[]` array at a specified index, and also the following index. (This is useful in various ways.) We make sure the one particular use case always optimizes away: The output goes to a locally allocated, non-escaping array of length 2. (The intrinsic does NOT allocate the array; this is an important division of labor.)

\2. Method-pairs: Each method returns 64 bits; the two methods take identical arguments. Ask C2 to pattern-match pairs of calls into an intrinsic node. Expensive to manage the nonstandard intrinsic nodes (with projections), but maybe the shortest route, if the first route is blocked.

In both cases, the IR has to boil down to a call to an intrinsic which returns a pair of 64-bit values. This means we also need a register convention, at least so we can unambiguously name the two output registers, whether they are fixed or allocatable in the IR.

_A word to the wise:_ At the lowest level, names like "high" and "low" are dangerous, because they presume arithmetic usage and endianness, which is not always present. The names "first" and "second" are much more robust, if they are coupled to whatever memory storage conventions apply to the datum in question. (I.e., not all outputs naturally have a natural high/low structure, but they can always be given a memory order.) Thus, if we use 2-arrays, it's obvious that `a[i]` takes the "first" and `a[i+1]` takes the "second" output, but it's anybody's guess what to do that will make sense for big-endian platforms. If we use method-pairs, one method must be designated as "first". The IR projections must be consistently numbered as well, obviously. Just... stay away from endian assumptions.

We will quickly find uses for out-of-line calling conventions, starting with intrinsics which need to call a stub instead of expand to assembly code. This leads to a third convention, which co-exists with either (or both) of the previous ones:

\3. Return registers: We need to fix (per platform) a standard calling sequence for returning pairs of 64-bit values. We should just pick a C type (`uint128_t` or `struct{uint64_t f,s;}`). Its behavior will depend on a platform-specific C ABI, which we should conform to if possible. This is what we already do for the Java primitive types. The interpreter doesn't need to know this convention, but the compilers do.

The out-of-line calling conventions may be useful for C1 code generation as well. The interpreter will just allocate the array and/or call both methods in the pair. The C2 IR, as noted above, is likely to register-allocate the output registers and inline the assembly code, if it exists. Otherwise it too would probably call a stub, with the same out-of-line calling convention.

After Valhalla, we might teach the interpreter to accept some 2-word value types to return directly as register pairs; or maybe not. The JIT output is what matters, and it matters both before and after Valhalla.

## 128-bit intrinsic examples

With conventions like the above (either 1 or 2), we can compile code like the following to a single `mulx` instruction on x86, and similar short sequences on other platforms:

```
value record LongPair(long fst, long snd) {
  // (Avoiding Int128, where maybe we'd have hi/lo instead of fst/snd.)

  static LongPair twoArrayExample(long x, long y) {
    var buf = new long[2]; // buf will be EA-ed and scalarized
    jdk.internal.math.MathUtils.mulx(x, y, buf, 0);
    // the call is replaced by a two-output intrinsic
    return new LongPair(buf[0], buf[1]);
  }

  static LongPair methodPairExample(long x, long y) {
    var hi = Math.multiplyHigh(x, y), lo = x * y;
    // the previous two ops are fused at the IR level into a new node
    return new LongPair(lo, hi); // conventional LE ordering
  }
}
```

## other uses of 128-bit data

For Java code, and for most JIT IR, it is acceptable to work with 128-bit values as pairs of 64-bit values. (They are differentiated according to some convention - hi/lo for numerics, or fst/snd for others, as the case may be.)

For Valhalla, such register pairs can easily be "wired up" to value-class instances, since they too are just loose bundles of field values in registers.

For vector processing, the register pairs might be reorganized as vectors, or split out from vector values. If this is done, care should be taken to minimize the reorganization steps, since they are expensive on all major platforms. In special cases, it might even be useful to write some intrinsics twice, once for general registers, and once for vectors. In the latter case, there might also be a version which operates in vector registers, SIMD-style, on _several_ 128-bit values at once. This could be prompted from the autovectorizer or the Vector API.

There is an obvious gap in the Vector API for 128-bit lane types. This can be filled in various ways, including operations on lane _pairs_. This corresponds to option 2 above. Hypothetically:

```
final var SHUFFLE_IN_FST = VectorShuffle.fromOp(x.species(), i -> i/2);
// Take the first half of each input; replicate in striped form.
LongVector xxin = x.rearrange(SHUFFLE_IN_FST);
LongVector yyin = y.rearrange(SHUFFLE_IN_FST);
// Output 64-bit vector is striped with 128-bit ints.
LongVector xyfst = xx.lanePairWise(MUL, MUL_HI, yy);
```

(Affirmation: This is solely my own work, without AI help. I happen to like markdown.)

relates to

JDK-8285871 Math.multiplyHigh and multiply on same inputs can be computed faster if their computation is shared

Open

JDK-8376833 Add sincos method to math library

Open

Details

Description

Attachments

Issue Links

Activity

People

Dates