Improve the performance of AArch64 OpenJDK port intrinsics for operations with lots of load/store operations, such as String and Array intrinsics.
- Compare to and match the performance of other architectures for optimized operations.
- Tune generic AArch64 port intrinsics for optimal performance on a single AArch64 architecture implementation only
- Port intrinsics to ARM CPU code branch
Specialized CPU architecture-specific code patterns improve the performance of user applications and benchmarks.
Intrinsics are used to leverage CPU architecture-specific assembly code which gets executed instead of generic Java code for a given method to improve performance. While most of the intrinsics are already implemented in AArch64 OpenJDK port, the current implementation of some intrinsics may not be optimal. Specifically, some intrinsics for AArch64 architectures may benefit from software prefetching instructions, memory address alignment, instructions placement for multi-piplining CPUs, replacement of certain instruction patterns with faster ones or using SIMD instructions.
This includes (but is not limited to) such typical operations as String::compareTo, String::indexOf, StringCoding::hasNegatives, Arrays::equals, StringUTF16::compress, StringLatin1::inflate and checksum calculations.
Depending on the intrinsic algorithm, most common intrinsic use case, and CPU specifics the following changes may be considered:
- Use the ARM NEON instruction set. Such code (if any will be created) will be placed under a flag (like UseSIMDForMemoryOps flag) in case the existing algorithm has non-NEON version.
- Use prefetch hint instruction (PRFM). The effect of this instruction depends on various factors like presence of a CPU hardware prefetcher and its capabilities, cpu/memory clock ratio, memory controller specifics and particular algorithm needs.
- Reorder instructions and reduce data dependencies to allow out-of-order execution where possible.
- Avoid unaligned memory access if needed. Some CPU implementations have penalties issuing load/store across 16-byte boundary, dcache_line boundary or have different optimal alignment for different load/store instructions (see, for example, Cortex A53 guide). If the aligned versions of intrinsics do not slow down code execution on alignment-independent CPUs, it may be beneficial to improve address alignment to help those CPUs that do have some penalties, provided it does not significantly increase code complexity.
- Revised intrinsics performance will be tested on Cortex A53 and Cavium ThunderX hardware using JMH benchmarks and SPECjvm2005 where applicable.
- Functional correctness will be tested using jtreg test suite. Additional tests might be created in case the existing testbase doesn't provide sufficient coverage.
Risks and Assumptions
- It is not possible to perform testing and measurements on all AArch64 hardware variants. We will rely on OpenJDK community to perform testing on hardware from vendors we currently do not have in-house should they find it necessary when patches are submitted for review.
- Efforts will be made to improve the performance of a generic AArch64 port intrinsic implementation. In cases where this is not possible, specialized versions of intrinsics for a given hardware vendor may need to be written.
- Intrinsics which are in scope of the JEP are CPU architecture-specific and changing them does not affect shared HotSpot code.