Loading...

Type: Sub-task
Resolution: Migrated
Priority: P4
Fix Version/s: tbd
Affects Version/s: 13
Component/s: hotspot
Labels:
- vectorIntrinsics

Subcomponent:
compiler

Summary
-------

Provide an initial iteration of an [incubator module], `jdk.incubator.vector`, to express
vector computations that reliably compile at runtime to optimal vector hardware instructions on
supported CPU architectures and thus achieve superior performance to equivalent scalar
computations.

[incubating module]: http://openjdk.java.net/jeps/11

Goals
-----

- *Clear and concise API:*
The API shall be capable of clearly and concisely expressing a wide range of
vector computations consisting of a sequence of vector operations often composed
within loops and possibly with control flow. It should be possible to express a
computation that is generic to vector size (or the number of lanes per vector)
thus enabling such computations to be portable across hardware supporting
different vector sizes.

- *Reliable runtime compilation and performance on x64 architectures:*
The Java runtime, specifically the HotSpot C2 compiler, shall compile, on
capable x64 architectures, a sequence of vector operations to a corresponding
sequence of vector hardware instructions, such as those supported by
[Streaming SIMD Extensions][SSE] (SSE) and [Advanced Vector Extensions][AVX]
(AVX) extensions, thereby generating efficient and performant code.
The programmer shall have confidence that the vector operations they express
will reliably map closely to associated hardware vector instructions.

[SSE]:https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
[AVX]:https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

- *Platform agnostic:*
The API shall be architecture agnostic, enabling support for runtime
implementations on more than one hardware vectorization that supports vector
hardware instructions.
As is usual in Java APIs, where platform optimization and portability
conflict, the bias will be to making the Vector API portable, even if some
platform-specific idioms cannot be directly expressed in portable code.
The previous goal of x64 performance is representative of appropriate
performance goals on all platforms where Java is supported.
The [ARM Scalable Vector Extension][SVE] (SVE) is of special interest in this
regard to ensure the API can support this architecture (even though as of
writing there are no known production hardware implementations).

[SVE]:https://arxiv.org/pdf/1803.06185.pdf

- *Graceful degradation:*
If a vector computation cannot be fully expressed at runtime as a sequence of
hardware vector instructions, either because an x64 archictecture does not
support some of the required instructions or because another CPU architecture is
not supported, then the Vector API implementation shall degrade gracefully and
still function. This may include issuing warnings to the developer if a vector
computation cannot be sufficiently compiled to vector hardware instructions.
On platforms without vectors, graceful degradation shall yield code
competitive with manually-unrolled loops, where the unroll factor is
the number of lanes in the selected vector.

Non-Goals
---------

- It is not a goal to enhance the auto-vectorization support in HotSpot.

- It is not a goal for HotSpot to support vector hardware instructions on CPU
architectures other than x64. Such support is left for later JEPs.
However, it is important to state, as expressed
in the goals, that the API must not rule out such implementations. Further,
work performed may naturally leverage and extend existing abstractions in
HotSpot for auto-vectorization vector support making such a task easier.

- It is not a goal to support the C1 compiler in this or future iterations. We
expect the Graal compiler to be supported in future work.

- It is not a goal to support strict floating point calculations as defined by
the Java `strictfp` keyword. The results of floating point operations performed
on floating point scalars may differ from equivalent floating point operations
performing on vectors of floating point scalars. However, this goal does not
rule out options to express or control the desired precision or reproducibility
of floating point vector computations.

Motivation
----------

Vector computations consist of a sequence of operations on vectors. A vector
comprises a (usually) fixed sequence of scalar values, where the scalar
values correspond to the number of hardware-defined vector lanes. A binary operation applied
to two vectors with the same number of lanes would, for each lane, apply the
equivalent scalar operation on the corresponding two scalar values from each
vector. This is commonly referred to as
[Single Instruction Multiple Data][SIMD] (SIMD).

[SIMD]:https://en.wikipedia.org/wiki/SIMD

Vector operations express a degree of parallelism that enables more work to be
performed in a single CPU cycle and thus can result in significant performance
gains. For example, given two vectors each covering a sequence of eight
integers (eight lanes), then the two vectors can be added together using a
single hardware instruction. The vector addition hardware instruction operates
on sixteen integers, performing eight integer additions, in the time it would
ordinarily take to operate on two integers, performing one integer addition.

HotSpot supports [auto-vectorization] where scalar operations are transformed into
superword operations, which are then mapped to vector hardware instructions.
The set of transformable scalar operations are limited and fragile to changes in
the code shape. Furthermore, only a subset of available vector hardware
instructions might be utilized limiting the performance of generated code.

[auto-vectorization]:http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf

A developer wishing to write scalar operations that are reliably transformed
into superword operations needs to understand HotSpot's auto-vectorization
support and its limitations to achieve reliable and sustainable performance.

In some cases it may not be possible for the developer to write scalar
operations that are transformable. For example, HotSpot does not transform the
simple scalar operations for calculating the hash code of an array (see the
`Arrays.hashCode` method implementations in the JDK source code), nor can it
auto-vectorize code to lexicographically compare two arrays (which why an
intrinsic was added to perform lexicographical comparison, see
[~~JDK-8033148~~][~~JDK-8033148~~]).

[~~JDK-8033148~~]:https://bugs.openjdk.java.net/browse/JDK-8033148

The Vector API aims to address these issues by providing a mechanism to write
complex vector algorithms in Java, using pre-existing support in HotSpot
for vectorization, but with a user model which makes vectorization far more
predictable and robust. Hand-coded vector loops can express high-performance
algorithms (such as vectorized `hashCode` or specialized array comparison)
which an auto-vectorizer may never optimize.
There are numerous domains where this explicitly vectorizing
API may be applicable such as machine learning, linear algebra, cryptography,
finance, and usages within the JDK itself.

Description
-----------

A vector will be represented by the abstract class `Vector<E>`.
The type variable `E` corresponds to the boxed type of scalar primitive integral
or floating point element types covered by the vector. A vector also has a shape,
which defines the size, in bits, of the vector. The shape of the vector will govern
how an instance of `Vector<E>` is mapped to a vector hardware register when vector
computations are compiled by the HotSpot C2 compiler (see later for a mapping from
instances to x64 vector registers). The length of a vector (number of lanes or elements)
will be the vector size divided by the element size.

The set of element types (`E`) supported will be `Byte`, `Short`, `Int`, `Long`,
`Float` and `Double` corresponding to the scalar primitive types `byte`,
`short`, `int`, `long`, `float` and `double`, respectively.

The set of shapes supported will be corresponding to vector sizes of 64, 128, 256, and 512 bits.
A shape corresponding to a size of 512 bits can pack `byte`s into 64 lanes or pack
`int`s into 16 lanes, and a vector of such a shape can operate on 64 `byte`s at
a time, or 16 `int`s at a time.

(_Note:_ We believe that these simple shapes are generic enough to be
useful on all platforms supporting the Vector API. However, as we
experiment during the incubation of this JEP with future platforms, we may further
modify the design of the shape parameter. Such work is not in
the early scope of this JEP, but these possibilities partly inform the
present role of shapes in the Vector API. See the "Future Work" section.)

The combination of element type and shape determines the vector's species, represented by `VectorSpecies<E>`

An instance of `Vector<E>` is immutable and is a value-based type that
retains, by default, object identity invariants (see later for relaxation of
these invariants).

`Vector<E>` declares a set of methods for common vector operations supported
by all element types. Such operations can be classified into groups of unary
(e.g. negation), binary (e.g. addition), and comparison (e.g. lessThan), with
corresponding familiar scalar operational equivalents. Further operations are
more specific to vectors, such as cross-lane operations (e.g. permuting elements),
converting to a vector of a different type and/or shape (e.g. casting), or
storing the vector elements into a data container (e.g. a `byte[]` array).

To support operations specific to an element type there are six
abstract sub-classes of `Vector<E>`, one for each supported element type,
`ByteVector`, `ShortVector`, `IntVector`, `LongVector`,
`FloatVector`, and `DoubleVector`. There are operations that are specific
to the integral sub-types, such as bitwise operations (e.g. logical xor),
and operations specific to the floating point types, such as mathematical
operations (e.g. transcendental functions like cosine). Other operations are
bound to the element type since the method signature refers to the element type
(or the equivalent array type), such as reduction operations (e.g. sum all
elements to a scalar value), or storing the vector elements to an array.

These classes are further extended by concrete sub-classes defined for different shapes (size) of Vectors.

The following table presents the concrete vector classes and their mapping to x64 registers:
```
| Vector | x64 register |
|----------------------------------------------------------------------------------------------|--------------|
| Byte64Vector, Short64Vector, Int64Vector, Long64Vector, Float64Vector, Double64Vector | xmm? |
| Byte128Vector, Short128Vector, Int128Vector, Long128Vector, Float128Vector, Double128Vector | xmm? |
| Byte256Vector, Short256Vector, Int256Vector, Long256Vector, Float256Vector, Double256Vector | ymm? |
| Byte512Vector, Short512Vector, Int512Vector, Long512Vector, Float512Vector, Double512Vector | zmm? |
```

These classes are non-public since there is no need to provide operations specific to the type and
shape. This reduces the API surface to a sum of concerns rather than a product. As a result instances
of concrete Vector classes cannot be constructed directly. Instead
instances are obtained via factories methods defined in the base `Vector<E>` and its type-specific sub-classes.
These methods take as input the species of the desired vector instance. The factory methods
provide different ways to obtain vector instances, such as the vector
instance whose elements are initiated to default values (the zero vector), or
a vector from an array, in addition to providing the canonical support for
converting between vectors of different types and/or shapes (e.g. casting).

Here is a simple scalar computation over elements of arrays:

    void scalarComputation(float[] a, float[] b, float[] c) {
       for (int i = 0; i < a.length; i++) {
            c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
       }
    }

(It is assumed that the array arguments will be of the same size.)

An explicit way to implement the equivalent vector computation using the Vector
API is as follows:

    static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_256;

    void vectorComputation(float[] a, float[] b, float[] c) {
        int i = 0;
        for (; i < (a.length & ~(SPECIES.length() - 1));
               i += SPECIES.length()) {
            // FloatVector va, vb, vc;
            var va = FloatVector.fromArray(SPECIES, a, i);
            var vb = FloatVector.fromArray(SPECIES, b, i);
            var vc = va.mul(va).
                        add(vb.mul(vb)).
                        neg();
            vc.intoArray(c, i);
        }

        for (; i < a.length; i++) {
            c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
        }
    }

In this example, a species for 256-bit wide vector of floats is obtained from FloatVector.
The species is stored in a static final field so the runtime compiler will treat the field's
value as a constant and therefore be able to better optimize the vector computation.

The vector computation features a main loop kernel iterating over the
arrays in strides of vector length (the species length). Static method fromArray() loads
`float` vectors of the given species from arrays `a` and `b` at the corresponding index. Then
the operations are performed (fluently), and finally the result is stored into
array `c`.

The scalar computation after the vector computation is required to process the
*tail* of elements, the length of which is smaller than the species length. We
shall see later how this example compiles to vector hardware instructions and
how the Java implementation can be improved upon.

To support control flow relevant vector operations will optionally accept masks,
represented by the public abstract class `VectorMask<E>`. Each element in a mask, a boolean
value or bit, corresponds to a vector lane. When a mask is an input to an operation
it governs whether the operation is applied to each lane; the operation is applied if the
mask bit for the lane is set (is true). Alternative behaviour occurs if the mask bit
is not set (is false).
Similar to vectors, instances of `VectorMask<E>` are instances of (private) concrete sub-class defined for
each element type and length combination. The instance of `VectorMask<E>` used in an operation should have the same
type and length as the instance(s) of `Vector<E>` involved in the operation. Comparison operations
produce masks, which can then be input to other operations to selectively disable the
operation on certain lanes and thereby emulate flow control. Another way for creating masks
is using static factory methods in `VectorMask<E>`.

It is anticipated that masks will likely play an important role in the
development of vector computations that are generic to shape. (This is based
on the central importance of predicate registers, the equivalent of masks, in
the ARM Scalable Vector Extensions as well as in Intel's AVX-512.)

### Example

Continuing with the example presented at the beginning of the description
section, the HotSpot compiler should generate machine code similar to the
following:

      0.43% / │ 0x0000000113d43890: vmovdqu 0x10(%r8,%rbx,4),%ymm0
      7.38% │ │ 0x0000000113d43897: vmovdqu 0x10(%r10,%rbx,4),%ymm1
      8.70% │ │ 0x0000000113d4389e: vmulps %ymm0,%ymm0,%ymm0
      5.60% │ │ 0x0000000113d438a2: vmulps %ymm1,%ymm1,%ymm1
     13.16% │ │ 0x0000000113d438a6: vaddps %ymm0,%ymm1,%ymm0
     21.86% │ │ 0x0000000113d438aa: vxorps -0x7ad76b2(%rip),%ymm0,%ymm0
      7.66% │ │ 0x0000000113d438b2: vmovdqu %ymm0,0x10(%r9,%rbx,4)
     26.20% │ │ 0x0000000113d438b9: add $0x8,%ebx
      6.44% │ │ 0x0000000113d438bc: cmp %r11d,%ebx
             \ │ 0x0000000113d438bf: jl 0x0000000113d43890

This is actual output from a JMH micro-benchmark for the example code under
test using a prototype of the Vector API and implementation (the
`vectorIntrinsics` branch of Project Panama's development repository).

The hot areas of C2 generated machine code are presented. There is a clear
translation to vector registers and vector hardware instructions. (Note loop
unrolling was disabled to make the translation clearer, otherwise HotSpot should
be able to unroll using existing C2 loop optimization techniques.). All Java
object allocations are elided.

It is an important goal to support more complex non-trivial vector computations
that translate clearly into generated machine code.

There are, however, a few issues with this particular vector computation:

1. The loop is hardcoded to a concrete vector shape, so the computation cannot
adapt dynamically to a maximal shape supported by the architecture (which may be
smaller or larger than 256 bits). Therefore the code is less portable and maybe
less performant.

2. Calculation of the loop upper bounds, although simple here, can be a common
source of programming error.

3. A scalar loop is required at the end, duplicating code.

The first two issues will be addressed by this JEP. A preferred species can be
obtained whose shape is optimal for the current architecture, the vector
computation can then be written with a generic shape, and a method on the
species can round down the array length, for example:

    static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

    void vectorComputation(float[] a, float[] b, float[] c,
            VectorSpecies<Float> species) {
        int i = 0;
        int upperBound = species.loopBound(a.length);
        for (; i < upperBound; i += species.length()) {
            //FloatVector va, vb, vc;
            var va = FloatVector.fromArray(species, a, i);
            var vb = FloatVector.fromArray(species, b, i);
            var vc = va.mul(va).
                        add(vb.mul(vb)).
                        neg();
            vc.intoArray(c, i);
        }

        for (; i < a.length; i++) {
            c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
        }
    }

    vectorComputation(a, b, c, SPECIES);

The last issue will not be fully addressed by this JEP and will be the subject
of follow on work. One solution could be easily expressed in the API by using
a mask as an additional loop variable, for example:

    void vectorComputation(float[] a, float[] b, float[] c,
            VectorSpecies<Float> species) {
        int i = 0;
        VectorMask<Float> m = species.maskFromBounds(0, length);
        for (int i = 0;
             i < a.length;
             i += species.length(), m = species.maskFromBounds(i, length)) {
            //FloatVector va, vb, vc;
            var va = FloatVector.fromArray(species, a, i, m);
            var vb = FloatVector.fromArray(species, b, i, m);
            var vc = va.mul(va).
                        add(vb.mul(vb)).
                        neg();
            vc.intoArray(c, i, m);
        }
    }

The mask will ensure that loads and stores will not result in out of bounds
exceptions (in this case the calculation is such that masks do not need to be
provided to the other vector operations). It is anticipated that such masked
loops will work well for a range of architectures, including x64 and ARM, but
will require additional runtime compiler support to generate maximally efficient
code. Such work on masked loops, though important, is beyond the scope of this
JEP.

### HotSpot C2 compiler implementation details

The Vector API has two implementations in order to adhere to the project goals.
The first implements operations in Java, thus it is functional but not optimal.
The second makes intrinsic, for the HotSpot C2 compiler, those operations with
special treatment for Vector API types. This allows for proper translation to
x64 registers and instructions for the case where architecture support and
implementation for translation exists.

The intrinsification process for the Vector API will work by translating
Vector API method calls to C2 IR Nodes that represent appropriate intended
semantics. For example, for `Float256Vector.add`, the C2 compiler
will replace the call with a `AddVF` node plus a `VectorBox` node. The `AddVF`
represents addition of two `float` vectors while the `VectorBox` represents the
boxing portion to create a valid object. (Thus `add` on two `Vector` objects
will produce a resulting `Vector` object.) This way, object creation (if any)
is submerged under the vector operation, so in cases where the object does not
need to exist, it can be eliminated.

The IR nodes generated by intrinsification will overlap with the IR nodes used
by vectorizer. However, because the Vector API will support a much larger set
of operations, additional IR nodes will be added as needed. In order to keep
the newly added nodes to a minimum, new nodes will no longer encode the type in
the operand name. For example, the `VectorBlend` node supports blending and
masking operations. There is no `VectorBlendI` node for `int` vectors.
Instead, the extra type
information is simply encoded using existing type system (`TypeVect`) which
encodes element type along with shape.

It is intended that for all of the vector operations defined by the API, there
will be a translation implemented that will allow use of x64 instructions on some
x64 architectures. For example, `Byte256Vector.blend` will
translate to `vpblendvb` (AVX2) where as `Byte512Vector.blend` will
translate to `vpblendmb` (AVX-512). The translation may be non-optimal. If
`Byte512Vector.blend` is used on a system that only supports AVX2,
no translation will occur and instead the default Java implementation will be
used. That said, the type-specific vector classes provide the `PREFERRED_SPECIES`
field corresponding to the appropriate vector size to use. Behind the
scenes, this field is set by calling into
`Matcher::vector_width_in_bytes` so that this value is dynamically computed
depending on the system. This species can be used for generically sized vector
computations so no concrete species needs be declared.

The set of operations on `Vector`, `VectorSpecies` and `VectorMask` will be selected for
their applicability for C2 intrinsification on x64 architectures. Additional
non-intrinsified operations may be placed off to the side in helper classes.
In future work, these divisions may be adjusted in order to provide more
fully platform agnostic API.

To avoid an explosion of intrinsics added to C2, a set of intrinsics will be
defined that correspond to operation kinds, such as binary, unary, comparison,
and so on, where constant arguments are passed describing operation specifics.
Approximately ten new intrinsics will be needed to support intrinsification of
all parts of the API.

The C2 compiler will have special knowledge of the `Vector`, `VectorSpecies` and
`VectorMask` types and all the sub-types. This will enable C2 to map instances of
`Vector`, to vector registers, and aggressively elide allocations when such
instances do not escape. C2 will also have knowledge for treatment of vector
registers and vector objects at safepoints so that it can safely save them and
also safely reconstruct Vector objects. Special attention will taken to ensure,
by default, object semantics (such as identity) are preserved when an instance
escapes or needs to be materialized as reference to a `Vector` object.

`Vector` instances are value-based, morally values where identity-sensitive
operations should be avoided. This potentially limits the set of applicable
optimizations, specifically due to the limitations of escape analysis. A flag
will be provided to enable `Vector` instances to have no guaranteed identity and
thereby support more aggressive optimizations such as lazy materialization at a
safepoint. When value types are fully supported by the Java language and
runtime (see [Project Valhalla][Valhalla]) then concrete `Vector` classes can be
made value types and it is anticipated such a flag and many optimizations will
no longer be required.

Mask support will require careful attention on x64 architectures since there are
two kinds of representations, a vector register representation or an `opmask`
register representation (for AVX-512), and different instructions will take one
or the other. In the initial implementation, it is expected that all masks will
be represented as vector registers even for AVX-512. This means that native
masking via `opmask` (or `k`) registers will not be supported in the first
implementation. Platforms like AVX-512 and ARM SVE motivate our treatment of
`Mask` as a special type rather than as an ordinary combination of `Vector` and
`boolean` types.

[Valhalla]:http://openjdk.java.net/projects/valhalla/

### Future Work

The Vector API will benefit significantly from value types when ready (see
[Project Valhalla](http://openjdk.java.net/projects/valhalla)). Instances of a
`Vector<E>` can be values, whose concrete classes are value types. This
will make it easier to optimize and express vector computations. Sub-types of
`Vector<E>` for specific types, such as `IntVector`, will no longer be
required with generic specialization over values and type-specific method
declaration. A shift to value types is thought to be backward compatible,
perhaps after recompilation of Vector API code. Some abstract classes may need
conversion to interfaces, if they are supers of value types.

A future version of the Vector API may make use of enhanced generics,
as noted above.

It is expected that the API will incubate over multiple releases of the JDK and will adapt as
dependent features such as value types become available in a future JDK release and
newer CPU architectures become more established in the industry.

API points for loop control, loop boundary processing, and active set
maintenance are likely to be added or refined in a future version of this API.
Additional vector shapes with intrinsic masks or lengths, or synthetic tandem
vector types (vector pairs) may be introduced if they are found to help with
loop management. Methods for alignment control may also be introduced, if they
show benefits in portability or performance.

Scatter and gather operations which can traverse managed heap pointers may be
introduced in the future, if a portable and type-safe way can be found to
express them (such as `VarHandle`s). This would allow workloads to be accessed
directly in Java objects, instead of being buffered through Java arrays or byte
buffers.

Additional vector sizes and shapes may be supported in a future version of this
API, in a follow-on JEP or perhaps during incubation. In principle the API could
express additional vector shape properties besides bit-size, such as whether a
vector is dense or not, whether it possesses an intrinsic mask, whether and how
it may be dynamically sized, whether the size is a power of two, etc.

A future version of this API may introduce additional, non-primitive lane types
such as short floats (useful for machine learning) or very long integers (useful
for cryptography), along with relevant specialized operations. Such types tend
to be hardware-specific, and so a challenge of specifying such API points is
either making them portable, or else properly scoping them to machine-specific
instances of the JDK.

Alternatives
------------

HotSpot's auto-vectorization is an alternative approach but it would require
significant enhancement and would likely still be fragile and limited compared
to using the Vector API, since auto-vectorization with complex control flow is
very hard to perform.

In general, and even after decades of research (especially for FORTRAN and C
array loops), it seems that auto-vectorization of scalar code is not a reliable
tactic for optimizing ad hoc user-written loops, unless the user pays unusually
careful attention to unwritten contracts about exactly which loops a compiler is
prepared to auto-vectorized. It's too easy to write a loop that fails to
auto-vectorize, for a reason that only the optimizer can detect, and not the
human reader. Years of work on auto-vectorization (even in HotSpot) have left
us with lots of optimization machinery that works only on special occasions. We
want to enjoy the use of this machinery more often!

Testing
-------

Combinatorial unit tests will be developed to ensure coverage for all
operations, for all supported types and shapes, over various data sets. The
tests will be implemented with TestNG and will be exercisable via `jtreg`.

Performance tests will be developed to ensure performance goals are met and
vector computations map efficiently to vector hardware instructions. This will
likely consistent of JMH micro-benchmarks but more realistic examples of useful
algorithms will also be required.

As a backup to performance tests, we will create white-box tests to
force the JIT to report to us that vector API source code did, in
fact, trigger vectorization.

Risks and Assumptions
---------------------

There is a risk that the API will be biased to the SIMD functionality supported
on x64 architectures. This applies mainly to the explicitly fixed set of
supported shapes, which bias against coding algorithms in a shape-generic
fashion. We consider the majority of other operations of the Vector API to bias
toward portable algorithms. To mitigate that risk other architectures will be
taken into account, specifically the ARM Scalar Vector Extension architecture
whose programming model adjusts dynamically to the singular fixed shape
supported by the hardware. We welcome and encourage OpenJDK contributors working
on the ARM specific areas of HotSpot to participate in this effort.

The Vector API uses box types (like `Integer`) as proxies for primitive types
(like `int`). This decision is forced by the current limitations of Java
generics (which are hostile to primitive types). When Project Vahalla
eventually introduces more capable generics, the current decision will seem
awkward, and may need changing. We assume that such changes will be possible
without excessive backwards incompatibility.

Details

Description

Attachments

Activity

People

Dates