Details
Description
Summary
Improve the existing string and array intrinsics, and implement new intrinsics for the java.lang.Math
sin, cos and log functions, on AArch64 processors.
Non-Goals
- Compare to and match the performance of other architectures
- Tune generic AArch64 port intrinsics for optimal performance on a single ARM64 architecture implementation only
- Port intrinsics to the ARM CPU port
Motivation
Specialized CPU architecture-specific code patterns improve the performance of user applications and benchmarks.
Description
Intrinsics are used to leverage CPU architecture-specific assembly code which is executed instead of generic Java code for a given method to improve performance. While most of the intrinsics are already implemented in AArch64 port, optimized intrinsics for the following java.lang.Math
methods are still missing:
- sin (sine trigonometric function)
- cos (cosine trigonometric function)
- log (logarithm of a number)
This JEP is intended to cover this gap by implementing optimized intrinsics for these methods.
At the same time, while most of the intrinsics are already implemented in the AArch64 port, the current implementation of some intrinsics may not be optimal. Specifically, some intrinsics for AArch64 architectures may benefit from software prefetching instructions, memory address alignment, instructions placement for multi-pipeline CPUs, and the replacement of certain instruction patterns with faster ones or with SIMD instructions.
This includes (but is not limited to) such typical operations as String::compareTo
, String::indexOf
, StringCoding::hasNegatives
, Arrays::equals
, StringUTF16::compress
, StringLatin1::inflate
, and various checksum calculations.
Depending on the intrinsic algorithm, the most common intrinsic use case, and CPU specifics, the following changes may be considered:
- Use the ARM NEON instruction set. Such code (if any will be created) will be placed under a flag (such as
UseSIMDForMemoryOps
) in case the existing algorithm has a non-NEON version. - Use the prefetch-hint instruction (PRFM). The effect of this instruction depends on various factors such as the presence of a CPU hardware prefetcher and its capabilities, the cpu/memory clock ratio, memory controller specifics, and particular algorithm needs.
- Reorder instructions and reduce data dependencies to allow out-of-order execution where possible.
- Avoid unaligned memory access if needed. Some CPU implementations impose penalties when issuing load/store instructions across a 16-byte boundary, a dcache-line boundary, or have different optimal alignment for different load/store instructions (see, for example, the Cortex A53 guide). If the aligned versions of intrinsics do not slow down code execution on alignment-independent CPUs, it may be beneficial to improve address alignment to help those CPUs that do have some penalties, provided it does not significantly increase code complexity.
Testing
- Intrinsics performance will be tested on Cavium ThunderX, ThunderX2 and Cortex A53 hardware using JMH benchmarks.
- Functional correctness will be tested using the
jtreg
test suite. Additional tests might be created in case existing testbase doesn't provide sufficient coverage.
Risks and Assumptions
- Efforts will be made to implement optimally-performant generic versions of the AArch64 intrinsics. In cases where this is not possible, specialized versions of the intrinsics for a given hardware vendor may need to be written.
- It is not possible to perform testing and performance measurements on all AArch64 hardware variants. We will rely on the OpenJDK Community to perform testing on hardware we currently do not have in-house should they find it necessary when patches are submitted for review.
- The intrinsics in scope for this JEP are CPU architecture-specific, so changing them does not affect shared HotSpot code.
Attachments
Issue Links
- is blocked by
-
JDK-8184943 AARCH64: Intrinsify hasNegatives
- Resolved
-
JDK-8187472 AARCH64: array_equals intrinsic doesn't use prefetch for large arrays
- Resolved
-
JDK-8189103 AARCH64: optimize String indexOf intrinsic
- Resolved
-
JDK-8189112 AARCH64: optimize StringUTF16 compress intrinsic
- Resolved
-
JDK-8189113 AARCH64: StringLatin1 inflate intrinsic doesn't use prefetch instruction
- Resolved
-
JDK-8202326 AARCH64: optimize string compare intrinsic
- Resolved
-
JDK-8189176 AARCH64: Improve _updateBytesCRC32 intrinsic
- Resolved
-
JDK-8189177 AARCH64: Improve _updateBytesCRC32C intrinsic
- Resolved
-
JDK-8189745 AARCH64: Use CRC32C intrinsic code in interpreter and C1
- Resolved
- relates to
-
JDK-8189106 AARCH64: create intrinsic for tan
- Open
-
JDK-8189107 AARCH64: create intrinsic for pow
- Open
-
JDK-8307332 AARCH64: create intrinsic for exp
- Open
-
JDK-8193806 AARCH64: create intrinsic for vectorized mismatch
- Closed
-
JDK-8189105 AARCH64: create intrinsic for sin and cos
- Resolved
-
JDK-8196402 AARCH64: create intrinsic for Math.log
- Resolved