Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8201271 JEP 338: Vector API (Incubator)
  3. JDK-8223047

Updated JEP 338: Vector API to reflect API changes

XMLWordPrintable

    • Icon: Sub-task Sub-task
    • Resolution: Migrated
    • Icon: P4 P4
    • tbd
    • 13
    • hotspot

      Summary
      -------

      Provide an initial iteration of an [incubator module], `jdk.incubator.vector`, to express
      vector computations that reliably compile at runtime to optimal vector hardware instructions on
      supported CPU architectures and thus achieve superior performance to equivalent scalar
      computations.

      [incubating module]: http://openjdk.java.net/jeps/11


      Goals
      -----

      - *Clear and concise API:*
      The API shall be capable of clearly and concisely expressing a wide range of
      vector computations consisting of a sequence of vector operations often composed
      within loops and possibly with control flow. It should be possible to express a
      computation that is generic to vector size (or the number of lanes per vector)
      thus enabling such computations to be portable across hardware supporting
      different vector sizes.

      - *Reliable runtime compilation and performance on x64 architectures:*
      The Java runtime, specifically the HotSpot C2 compiler, shall compile, on
      capable x64 architectures, a sequence of vector operations to a corresponding
      sequence of vector hardware instructions, such as those supported by
      [Streaming SIMD Extensions][SSE] (SSE) and [Advanced Vector Extensions][AVX]
      (AVX) extensions, thereby generating efficient and performant code.
      The programmer shall have confidence that the vector operations they express
      will reliably map closely to associated hardware vector instructions.

      [SSE]:https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
      [AVX]:https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

      - *Platform agnostic:*
      The API shall be architecture agnostic, enabling support for runtime
      implementations on more than one hardware vectorization that supports vector
      hardware instructions.
      As is usual in Java APIs, where platform optimization and portability
      conflict, the bias will be to making the Vector API portable, even if some
      platform-specific idioms cannot be directly expressed in portable code.
      The previous goal of x64 performance is representative of appropriate
      performance goals on all platforms where Java is supported.
      The [ARM Scalable Vector Extension][SVE] (SVE) is of special interest in this
      regard to ensure the API can support this architecture (even though as of
      writing there are no known production hardware implementations).

      [SVE]:https://arxiv.org/pdf/1803.06185.pdf

      - *Graceful degradation:*
      If a vector computation cannot be fully expressed at runtime as a sequence of
      hardware vector instructions, either because an x64 archictecture does not
      support some of the required instructions or because another CPU architecture is
      not supported, then the Vector API implementation shall degrade gracefully and
      still function. This may include issuing warnings to the developer if a vector
      computation cannot be sufficiently compiled to vector hardware instructions.
      On platforms without vectors, graceful degradation shall yield code
      competitive with manually-unrolled loops, where the unroll factor is
      the number of lanes in the selected vector.


      Non-Goals
      ---------

      - It is not a goal to enhance the auto-vectorization support in HotSpot.

      - It is not a goal for HotSpot to support vector hardware instructions on CPU
      architectures other than x64. Such support is left for later JEPs.
      However, it is important to state, as expressed
      in the goals, that the API must not rule out such implementations. Further,
      work performed may naturally leverage and extend existing abstractions in
      HotSpot for auto-vectorization vector support making such a task easier.

      - It is not a goal to support the C1 compiler in this or future iterations. We
      expect the Graal compiler to be supported in future work.

      - It is not a goal to support strict floating point calculations as defined by
      the Java `strictfp` keyword. The results of floating point operations performed
      on floating point scalars may differ from equivalent floating point operations
      performing on vectors of floating point scalars. However, this goal does not
      rule out options to express or control the desired precision or reproducibility
      of floating point vector computations.


      Motivation
      ----------

      Vector computations consist of a sequence of operations on vectors. A vector
      comprises a (usually) fixed sequence of scalar values, where the scalar
      values correspond to the number of hardware-defined vector lanes. A binary operation applied
      to two vectors with the same number of lanes would, for each lane, apply the
      equivalent scalar operation on the corresponding two scalar values from each
      vector. This is commonly referred to as
      [Single Instruction Multiple Data][SIMD] (SIMD).

      [SIMD]:https://en.wikipedia.org/wiki/SIMD

      Vector operations express a degree of parallelism that enables more work to be
      performed in a single CPU cycle and thus can result in significant performance
      gains. For example, given two vectors each covering a sequence of eight
      integers (eight lanes), then the two vectors can be added together using a
      single hardware instruction. The vector addition hardware instruction operates
      on sixteen integers, performing eight integer additions, in the time it would
      ordinarily take to operate on two integers, performing one integer addition.

      HotSpot supports [auto-vectorization] where scalar operations are transformed into
      superword operations, which are then mapped to vector hardware instructions.
      The set of transformable scalar operations are limited and fragile to changes in
      the code shape. Furthermore, only a subset of available vector hardware
      instructions might be utilized limiting the performance of generated code.

      [auto-vectorization]:http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf

      A developer wishing to write scalar operations that are reliably transformed
      into superword operations needs to understand HotSpot's auto-vectorization
      support and its limitations to achieve reliable and sustainable performance.

      In some cases it may not be possible for the developer to write scalar
      operations that are transformable. For example, HotSpot does not transform the
      simple scalar operations for calculating the hash code of an array (see the
      `Arrays.hashCode` method implementations in the JDK source code), nor can it
      auto-vectorize code to lexicographically compare two arrays (which why an
      intrinsic was added to perform lexicographical comparison, see
      [JDK-8033148][JDK-8033148]).

      [JDK-8033148]:https://bugs.openjdk.java.net/browse/JDK-8033148

      The Vector API aims to address these issues by providing a mechanism to write
      complex vector algorithms in Java, using pre-existing support in HotSpot
      for vectorization, but with a user model which makes vectorization far more
      predictable and robust. Hand-coded vector loops can express high-performance
      algorithms (such as vectorized `hashCode` or specialized array comparison)
      which an auto-vectorizer may never optimize.
      There are numerous domains where this explicitly vectorizing
      API may be applicable such as machine learning, linear algebra, cryptography,
      finance, and usages within the JDK itself.


      Description
      -----------

      A vector will be represented by the abstract class `Vector<E>`.
      The type variable `E` corresponds to the boxed type of scalar primitive integral
      or floating point element types covered by the vector. A vector also has a shape,
      which defines the size, in bits, of the vector. The shape of the vector will govern
      how an instance of `Vector<E>` is mapped to a vector hardware register when vector
      computations are compiled by the HotSpot C2 compiler (see later for a mapping from
      instances to x64 vector registers). The length of a vector (number of lanes or elements)
      will be the vector size divided by the element size.

      The set of element types (`E`) supported will be `Byte`, `Short`, `Int`, `Long`,
      `Float` and `Double` corresponding to the scalar primitive types `byte`,
      `short`, `int`, `long`, `float` and `double`, respectively.

      The set of shapes supported will be corresponding to vector sizes of 64, 128, 256, and 512 bits.
      A shape corresponding to a size of 512 bits can pack `byte`s into 64 lanes or pack
      `int`s into 16 lanes, and a vector of such a shape can operate on 64 `byte`s at
      a time, or 16 `int`s at a time.

      (_Note:_ We believe that these simple shapes are generic enough to be
      useful on all platforms supporting the Vector API. However, as we
      experiment during the incubation of this JEP with future platforms, we may further
      modify the design of the shape parameter. Such work is not in
      the early scope of this JEP, but these possibilities partly inform the
      present role of shapes in the Vector API. See the "Future Work" section.)

      The combination of element type and shape determines the vector's species, represented by `VectorSpecies<E>`

      An instance of `Vector<E>` is immutable and is a value-based type that
      retains, by default, object identity invariants (see later for relaxation of
      these invariants).

      `Vector<E>` declares a set of methods for common vector operations supported
      by all element types. Such operations can be classified into groups of unary
      (e.g. negation), binary (e.g. addition), and comparison (e.g. lessThan), with
      corresponding familiar scalar operational equivalents. Further operations are
      more specific to vectors, such as cross-lane operations (e.g. permuting elements),
      converting to a vector of a different type and/or shape (e.g. casting), or
      storing the vector elements into a data container (e.g. a `byte[]` array).

      To support operations specific to an element type there are six
      abstract sub-classes of `Vector<E>`, one for each supported element type,
      `ByteVector`, `ShortVector`, `IntVector`, `LongVector`,
      `FloatVector`, and `DoubleVector`. There are operations that are specific
      to the integral sub-types, such as bitwise operations (e.g. logical xor),
      and operations specific to the floating point types, such as mathematical
      operations (e.g. transcendental functions like cosine). Other operations are
      bound to the element type since the method signature refers to the element type
      (or the equivalent array type), such as reduction operations (e.g. sum all
      elements to a scalar value), or storing the vector elements to an array.

      These classes are further extended by concrete sub-classes defined for different shapes (size) of Vectors.

      The following table presents the concrete vector classes and their mapping to x64 registers:
      ```
      | Vector | x64 register |
      |----------------------------------------------------------------------------------------------|--------------|
      | Byte64Vector, Short64Vector, Int64Vector, Long64Vector, Float64Vector, Double64Vector | xmm? |
      | Byte128Vector, Short128Vector, Int128Vector, Long128Vector, Float128Vector, Double128Vector | xmm? |
      | Byte256Vector, Short256Vector, Int256Vector, Long256Vector, Float256Vector, Double256Vector | ymm? |
      | Byte512Vector, Short512Vector, Int512Vector, Long512Vector, Float512Vector, Double512Vector | zmm? |
      ```

      These classes are non-public since there is no need to provide operations specific to the type and
      shape. This reduces the API surface to a sum of concerns rather than a product. As a result instances
      of concrete Vector classes cannot be constructed directly. Instead
      instances are obtained via factories methods defined in the base `Vector<E>` and its type-specific sub-classes.
      These methods take as input the species of the desired vector instance. The factory methods
      provide different ways to obtain vector instances, such as the vector
      instance whose elements are initiated to default values (the zero vector), or
      a vector from an array, in addition to providing the canonical support for
      converting between vectors of different types and/or shapes (e.g. casting).

      Here is a simple scalar computation over elements of arrays:

          void scalarComputation(float[] a, float[] b, float[] c) {
             for (int i = 0; i < a.length; i++) {
                  c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
             }
          }

      (It is assumed that the array arguments will be of the same size.)

      An explicit way to implement the equivalent vector computation using the Vector
      API is as follows:

          static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_256;

          void vectorComputation(float[] a, float[] b, float[] c) {
              int i = 0;
              for (; i < (a.length & ~(SPECIES.length() - 1));
                     i += SPECIES.length()) {
                  // FloatVector va, vb, vc;
                  var va = FloatVector.fromArray(SPECIES, a, i);
                  var vb = FloatVector.fromArray(SPECIES, b, i);
                  var vc = va.mul(va).
                              add(vb.mul(vb)).
                              neg();
                  vc.intoArray(c, i);
              }

              for (; i < a.length; i++) {
                  c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
              }
          }


      In this example, a species for 256-bit wide vector of floats is obtained from FloatVector.
      The species is stored in a static final field so the runtime compiler will treat the field's
      value as a constant and therefore be able to better optimize the vector computation.

      The vector computation features a main loop kernel iterating over the
      arrays in strides of vector length (the species length). Static method fromArray() loads
      `float` vectors of the given species from arrays `a` and `b` at the corresponding index. Then
      the operations are performed (fluently), and finally the result is stored into
      array `c`.

      The scalar computation after the vector computation is required to process the
      *tail* of elements, the length of which is smaller than the species length. We
      shall see later how this example compiles to vector hardware instructions and
      how the Java implementation can be improved upon.


      To support control flow relevant vector operations will optionally accept masks,
      represented by the public abstract class `VectorMask<E>`. Each element in a mask, a boolean
      value or bit, corresponds to a vector lane. When a mask is an input to an operation
      it governs whether the operation is applied to each lane; the operation is applied if the
      mask bit for the lane is set (is true). Alternative behaviour occurs if the mask bit
      is not set (is false).
      Similar to vectors, instances of `VectorMask<E>` are instances of (private) concrete sub-class defined for
      each element type and length combination. The instance of `VectorMask<E>` used in an operation should have the same
      type and length as the instance(s) of `Vector<E>` involved in the operation. Comparison operations
      produce masks, which can then be input to other operations to selectively disable the
      operation on certain lanes and thereby emulate flow control. Another way for creating masks
      is using static factory methods in `VectorMask<E>`.

      It is anticipated that masks will likely play an important role in the
      development of vector computations that are generic to shape. (This is based
      on the central importance of predicate registers, the equivalent of masks, in
      the ARM Scalable Vector Extensions as well as in Intel's AVX-512.)


      ### Example

      Continuing with the example presented at the beginning of the description
      section, the HotSpot compiler should generate machine code similar to the
      following:

            0.43% / │ 0x0000000113d43890: vmovdqu 0x10(%r8,%rbx,4),%ymm0
            7.38% │ │ 0x0000000113d43897: vmovdqu 0x10(%r10,%rbx,4),%ymm1
            8.70% │ │ 0x0000000113d4389e: vmulps %ymm0,%ymm0,%ymm0
            5.60% │ │ 0x0000000113d438a2: vmulps %ymm1,%ymm1,%ymm1
           13.16% │ │ 0x0000000113d438a6: vaddps %ymm0,%ymm1,%ymm0
           21.86% │ │ 0x0000000113d438aa: vxorps -0x7ad76b2(%rip),%ymm0,%ymm0
            7.66% │ │ 0x0000000113d438b2: vmovdqu %ymm0,0x10(%r9,%rbx,4)
           26.20% │ │ 0x0000000113d438b9: add $0x8,%ebx
            6.44% │ │ 0x0000000113d438bc: cmp %r11d,%ebx
                   \ │ 0x0000000113d438bf: jl 0x0000000113d43890

      This is actual output from a JMH micro-benchmark for the example code under
      test using a prototype of the Vector API and implementation (the
      `vectorIntrinsics` branch of Project Panama's development repository).

      The hot areas of C2 generated machine code are presented. There is a clear
      translation to vector registers and vector hardware instructions. (Note loop
      unrolling was disabled to make the translation clearer, otherwise HotSpot should
      be able to unroll using existing C2 loop optimization techniques.). All Java
      object allocations are elided.

      It is an important goal to support more complex non-trivial vector computations
      that translate clearly into generated machine code.

      There are, however, a few issues with this particular vector computation:

      1. The loop is hardcoded to a concrete vector shape, so the computation cannot
      adapt dynamically to a maximal shape supported by the architecture (which may be
      smaller or larger than 256 bits). Therefore the code is less portable and maybe
      less performant.

      2. Calculation of the loop upper bounds, although simple here, can be a common
      source of programming error.

      3. A scalar loop is required at the end, duplicating code.

      The first two issues will be addressed by this JEP. A preferred species can be
      obtained whose shape is optimal for the current architecture, the vector
      computation can then be written with a generic shape, and a method on the
      species can round down the array length, for example:

          static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

          void vectorComputation(float[] a, float[] b, float[] c,
                  VectorSpecies<Float> species) {
              int i = 0;
              int upperBound = species.loopBound(a.length);
              for (; i < upperBound; i += species.length()) {
                  //FloatVector va, vb, vc;
                  var va = FloatVector.fromArray(species, a, i);
                  var vb = FloatVector.fromArray(species, b, i);
                  var vc = va.mul(va).
                              add(vb.mul(vb)).
                              neg();
                  vc.intoArray(c, i);
              }

              for (; i < a.length; i++) {
                  c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
              }
          }

          vectorComputation(a, b, c, SPECIES);

      The last issue will not be fully addressed by this JEP and will be the subject
      of follow on work. One solution could be easily expressed in the API by using
      a mask as an additional loop variable, for example:

          void vectorComputation(float[] a, float[] b, float[] c,
                  VectorSpecies<Float> species) {
              int i = 0;
              VectorMask<Float> m = species.maskFromBounds(0, length);
              for (int i = 0;
                   i < a.length;
                   i += species.length(), m = species.maskFromBounds(i, length)) {
                  //FloatVector va, vb, vc;
                  var va = FloatVector.fromArray(species, a, i, m);
                  var vb = FloatVector.fromArray(species, b, i, m);
                  var vc = va.mul(va).
                              add(vb.mul(vb)).
                              neg();
                  vc.intoArray(c, i, m);
              }
          }

      The mask will ensure that loads and stores will not result in out of bounds
      exceptions (in this case the calculation is such that masks do not need to be
      provided to the other vector operations). It is anticipated that such masked
      loops will work well for a range of architectures, including x64 and ARM, but
      will require additional runtime compiler support to generate maximally efficient
      code. Such work on masked loops, though important, is beyond the scope of this
      JEP.

      ### HotSpot C2 compiler implementation details

      The Vector API has two implementations in order to adhere to the project goals.
      The first implements operations in Java, thus it is functional but not optimal.
      The second makes intrinsic, for the HotSpot C2 compiler, those operations with
      special treatment for Vector API types. This allows for proper translation to
      x64 registers and instructions for the case where architecture support and
      implementation for translation exists.

      The intrinsification process for the Vector API will work by translating
      Vector API method calls to C2 IR Nodes that represent appropriate intended
      semantics. For example, for `Float256Vector.add`, the C2 compiler
      will replace the call with a `AddVF` node plus a `VectorBox` node. The `AddVF`
      represents addition of two `float` vectors while the `VectorBox` represents the
      boxing portion to create a valid object. (Thus `add` on two `Vector` objects
      will produce a resulting `Vector` object.) This way, object creation (if any)
      is submerged under the vector operation, so in cases where the object does not
      need to exist, it can be eliminated.

      The IR nodes generated by intrinsification will overlap with the IR nodes used
      by vectorizer. However, because the Vector API will support a much larger set
      of operations, additional IR nodes will be added as needed. In order to keep
      the newly added nodes to a minimum, new nodes will no longer encode the type in
      the operand name. For example, the `VectorBlend` node supports blending and
      masking operations. There is no `VectorBlendI` node for `int` vectors.
      Instead, the extra type
      information is simply encoded using existing type system (`TypeVect`) which
      encodes element type along with shape.

      It is intended that for all of the vector operations defined by the API, there
      will be a translation implemented that will allow use of x64 instructions on some
      x64 architectures. For example, `Byte256Vector.blend` will
      translate to `vpblendvb` (AVX2) where as `Byte512Vector.blend` will
      translate to `vpblendmb` (AVX-512). The translation may be non-optimal. If
      `Byte512Vector.blend` is used on a system that only supports AVX2,
      no translation will occur and instead the default Java implementation will be
      used. That said, the type-specific vector classes provide the `PREFERRED_SPECIES`
      field corresponding to the appropriate vector size to use. Behind the
      scenes, this field is set by calling into
      `Matcher::vector_width_in_bytes` so that this value is dynamically computed
      depending on the system. This species can be used for generically sized vector
      computations so no concrete species needs be declared.

      The set of operations on `Vector`, `VectorSpecies` and `VectorMask` will be selected for
      their applicability for C2 intrinsification on x64 architectures. Additional
      non-intrinsified operations may be placed off to the side in helper classes.
      In future work, these divisions may be adjusted in order to provide more
      fully platform agnostic API.

      To avoid an explosion of intrinsics added to C2, a set of intrinsics will be
      defined that correspond to operation kinds, such as binary, unary, comparison,
      and so on, where constant arguments are passed describing operation specifics.
      Approximately ten new intrinsics will be needed to support intrinsification of
      all parts of the API.

      The C2 compiler will have special knowledge of the `Vector`, `VectorSpecies` and
      `VectorMask` types and all the sub-types. This will enable C2 to map instances of
      `Vector`, to vector registers, and aggressively elide allocations when such
      instances do not escape. C2 will also have knowledge for treatment of vector
      registers and vector objects at safepoints so that it can safely save them and
      also safely reconstruct Vector objects. Special attention will taken to ensure,
      by default, object semantics (such as identity) are preserved when an instance
      escapes or needs to be materialized as reference to a `Vector` object.

      `Vector` instances are value-based, morally values where identity-sensitive
      operations should be avoided. This potentially limits the set of applicable
      optimizations, specifically due to the limitations of escape analysis. A flag
      will be provided to enable `Vector` instances to have no guaranteed identity and
      thereby support more aggressive optimizations such as lazy materialization at a
      safepoint. When value types are fully supported by the Java language and
      runtime (see [Project Valhalla][Valhalla]) then concrete `Vector` classes can be
      made value types and it is anticipated such a flag and many optimizations will
      no longer be required.

      Mask support will require careful attention on x64 architectures since there are
      two kinds of representations, a vector register representation or an `opmask`
      register representation (for AVX-512), and different instructions will take one
      or the other. In the initial implementation, it is expected that all masks will
      be represented as vector registers even for AVX-512. This means that native
      masking via `opmask` (or `k`) registers will not be supported in the first
      implementation. Platforms like AVX-512 and ARM SVE motivate our treatment of
      `Mask` as a special type rather than as an ordinary combination of `Vector` and
      `boolean` types.

      [Valhalla]:http://openjdk.java.net/projects/valhalla/

      ### Future Work

      The Vector API will benefit significantly from value types when ready (see
      [Project Valhalla](http://openjdk.java.net/projects/valhalla)). Instances of a
      `Vector<E>` can be values, whose concrete classes are value types. This
      will make it easier to optimize and express vector computations. Sub-types of
      `Vector<E>` for specific types, such as `IntVector`, will no longer be
      required with generic specialization over values and type-specific method
      declaration. A shift to value types is thought to be backward compatible,
      perhaps after recompilation of Vector API code. Some abstract classes may need
      conversion to interfaces, if they are supers of value types.

      A future version of the Vector API may make use of enhanced generics,
      as noted above.

      It is expected that the API will incubate over multiple releases of the JDK and will adapt as
      dependent features such as value types become available in a future JDK release and
      newer CPU architectures become more established in the industry.

      API points for loop control, loop boundary processing, and active set
      maintenance are likely to be added or refined in a future version of this API.
      Additional vector shapes with intrinsic masks or lengths, or synthetic tandem
      vector types (vector pairs) may be introduced if they are found to help with
      loop management. Methods for alignment control may also be introduced, if they
      show benefits in portability or performance.

      Scatter and gather operations which can traverse managed heap pointers may be
      introduced in the future, if a portable and type-safe way can be found to
      express them (such as `VarHandle`s). This would allow workloads to be accessed
      directly in Java objects, instead of being buffered through Java arrays or byte
      buffers.

      Additional vector sizes and shapes may be supported in a future version of this
      API, in a follow-on JEP or perhaps during incubation. In principle the API could
      express additional vector shape properties besides bit-size, such as whether a
      vector is dense or not, whether it possesses an intrinsic mask, whether and how
      it may be dynamically sized, whether the size is a power of two, etc.

      A future version of this API may introduce additional, non-primitive lane types
      such as short floats (useful for machine learning) or very long integers (useful
      for cryptography), along with relevant specialized operations. Such types tend
      to be hardware-specific, and so a challenge of specifying such API points is
      either making them portable, or else properly scoping them to machine-specific
      instances of the JDK.

      Alternatives
      ------------

      HotSpot's auto-vectorization is an alternative approach but it would require
      significant enhancement and would likely still be fragile and limited compared
      to using the Vector API, since auto-vectorization with complex control flow is
      very hard to perform.

      In general, and even after decades of research (especially for FORTRAN and C
      array loops), it seems that auto-vectorization of scalar code is not a reliable
      tactic for optimizing ad hoc user-written loops, unless the user pays unusually
      careful attention to unwritten contracts about exactly which loops a compiler is
      prepared to auto-vectorized. It's too easy to write a loop that fails to
      auto-vectorize, for a reason that only the optimizer can detect, and not the
      human reader. Years of work on auto-vectorization (even in HotSpot) have left
      us with lots of optimization machinery that works only on special occasions. We
      want to enjoy the use of this machinery more often!


      Testing
      -------

      Combinatorial unit tests will be developed to ensure coverage for all
      operations, for all supported types and shapes, over various data sets. The
      tests will be implemented with TestNG and will be exercisable via `jtreg`.

      Performance tests will be developed to ensure performance goals are met and
      vector computations map efficiently to vector hardware instructions. This will
      likely consistent of JMH micro-benchmarks but more realistic examples of useful
      algorithms will also be required.

      As a backup to performance tests, we will create white-box tests to
      force the JIT to report to us that vector API source code did, in
      fact, trigger vectorization.


      Risks and Assumptions
      ---------------------

      There is a risk that the API will be biased to the SIMD functionality supported
      on x64 architectures. This applies mainly to the explicitly fixed set of
      supported shapes, which bias against coding algorithms in a shape-generic
      fashion. We consider the majority of other operations of the Vector API to bias
      toward portable algorithms. To mitigate that risk other architectures will be
      taken into account, specifically the ARM Scalar Vector Extension architecture
      whose programming model adjusts dynamically to the singular fixed shape
      supported by the hardware. We welcome and encourage OpenJDK contributors working
      on the ARM specific areas of HotSpot to participate in this effort.

      The Vector API uses box types (like `Integer`) as proxies for primitive types
      (like `int`). This decision is forced by the current limitations of Java
      generics (which are hostile to primitive types). When Project Vahalla
      eventually introduces more capable generics, the current decision will seem
      awkward, and may need changing. We assume that such changes will be possible
      without excessive backwards incompatibility.

            Unassigned Unassigned
            kkharbas Kishor Kharbas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: