Improve the existing string and array intrinsics, and implement new intrinsics for the
java.lang.Math sin, cos and log functions, on AArch64 processors.
- Compare to and match the performance of other architectures
- Tune generic AArch64 port intrinsics for optimal performance on a single ARM64 architecture implementation only
- Port intrinsics to the ARM CPU port
Specialized CPU architecture-specific code patterns improve the performance of user applications and benchmarks.
Intrinsics are used to leverage CPU architecture-specific assembly code which is executed instead of generic Java code for a given method to improve performance. While most of the intrinsics are already implemented in AArch64 port, optimized intrinsics for the following
java.lang.Math methods are still missing:
- sin (sine trigonometric function)
- cos (cosine trigonometric function)
- log (logarithm of a number)
This JEP is intended to cover this gap by implementing optimized intrinsics for these methods.
At the same time, while most of the intrinsics are already implemented in the AArch64 port, the current implementation of some intrinsics may not be optimal. Specifically, some intrinsics for AArch64 architectures may benefit from software prefetching instructions, memory address alignment, instructions placement for multi-pipeline CPUs, and the replacement of certain instruction patterns with faster ones or with SIMD instructions.
This includes (but is not limited to) such typical operations as
StringLatin1::inflate, and various checksum calculations.
Depending on the intrinsic algorithm, the most common intrinsic use case, and CPU specifics, the following changes may be considered:
- Use the ARM NEON instruction set. Such code (if any will be created) will be placed under a flag (such as
UseSIMDForMemoryOps) in case the existing algorithm has a non-NEON version.
- Use the prefetch-hint instruction (PRFM). The effect of this instruction depends on various factors such as the presence of a CPU hardware prefetcher and its capabilities, the cpu/memory clock ratio, memory controller specifics, and particular algorithm needs.
- Reorder instructions and reduce data dependencies to allow out-of-order execution where possible.
- Avoid unaligned memory access if needed. Some CPU implementations impose penalties when issuing load/store instructions across a 16-byte boundary, a dcache-line boundary, or have different optimal alignment for different load/store instructions (see, for example, the Cortex A53 guide). If the aligned versions of intrinsics do not slow down code execution on alignment-independent CPUs, it may be beneficial to improve address alignment to help those CPUs that do have some penalties, provided it does not significantly increase code complexity.
- Intrinsics performance will be tested on Cavium ThunderX, ThunderX2 and Cortex A53 hardware using JMH benchmarks.
- Functional correctness will be tested using the
jtregtest suite. Additional tests might be created in case existing testbase doesn't provide sufficient coverage.
Risks and Assumptions
- Efforts will be made to implement optimally-performant generic versions of the AArch64 intrinsics. In cases where this is not possible, specialized versions of the intrinsics for a given hardware vendor may need to be written.
- It is not possible to perform testing and performance measurements on all AArch64 hardware variants. We will rely on the OpenJDK Community to perform testing on hardware we currently do not have in-house should they find it necessary when patches are submitted for review.
- The intrinsics in scope for this JEP are CPU architecture-specific, so changing them does not affect shared HotSpot code.