-
Enhancement
-
Resolution: Unresolved
-
P3
-
21
-
generic
-
generic
Post loop vectorization takes advantage of vector mask (predicate) features of some hardware platforms, such as x86 AVX-512 and AArch64 SVE, to vectorize tail iterations of loops for better performance. The existing implementation in the C2 compiler has a long history. It was first implemented in JDK-8153998 in 2016 under a C2's experimental feature PostLoopMultiversioning to support x86 AVX-512 vector masks. Due to insufficient maintenance, it had been broken for a very long time. Last year, We took over JDK-8183390 to fix and re-enable this feature. Several issues were fixed and AArch64 vector mask support was added at that time. As we proposed to make post loop vectorization non-experimental in future JDK releases, we did some stress tests early in this year but found more problems inside. The problems include stability, maintainability and performance.
1. Stability
Multiple C2 crash or mis-compilation issues related to post loop vectorization were filed on JBS, includingJDK-8301657, JDK-8301904, JDK-8301944, JDK-8304774, JDK-8308949 and perhaps more with recent C2 patches.
2. Maintainability
The original implementation is based on multi-versioned post loops and the code is mixed in SuperWord. But post loop vectorization does not actually use the SLP algorithm. So there is a lot of special handling for post loops in current SuperWord code. As more and more features are added in SuperWord, the legacy code is becoming more and more difficult to maintain and extend.
3. Performance
Post loop vectorization was expected to bring obvious performance benefit for small iteration loops. But JMH tests showed it didn't. A main reason is that the multi-versioned vector post loop is jumped over from main loop's minimum-trip guard if the whole loop has very few iterations (read JDK-8307084 to learn more). The previous implementation also has limited vectorization ability, such as it can only vectorize loop statements with single data size.
For better stability, maintainability and performance, we now propose to deprecate current multi-versioning framework and completely re-implement the experimental post loop vectorization, for both x86 AVX-512 and AArch64 SVE. Our new proposal is to add a standalone ideal loop phase (outside SuperWord) to do vector mask transformation directly on the original scalar post loop.
Patch for this is expected to be targeted to JDK 22.
1. Stability
Multiple C2 crash or mis-compilation issues related to post loop vectorization were filed on JBS, including
2. Maintainability
The original implementation is based on multi-versioned post loops and the code is mixed in SuperWord. But post loop vectorization does not actually use the SLP algorithm. So there is a lot of special handling for post loops in current SuperWord code. As more and more features are added in SuperWord, the legacy code is becoming more and more difficult to maintain and extend.
3. Performance
Post loop vectorization was expected to bring obvious performance benefit for small iteration loops. But JMH tests showed it didn't. A main reason is that the multi-versioned vector post loop is jumped over from main loop's minimum-trip guard if the whole loop has very few iterations (read JDK-8307084 to learn more). The previous implementation also has limited vectorization ability, such as it can only vectorize loop statements with single data size.
For better stability, maintainability and performance, we now propose to deprecate current multi-versioning framework and completely re-implement the experimental post loop vectorization, for both x86 AVX-512 and AArch64 SVE. Our new proposal is to add a standalone ideal loop phase (outside SuperWord) to do vector mask transformation directly on the original scalar post loop.
Patch for this is expected to be targeted to JDK 22.
- relates to
-
JDK-8183390 Fix and re-enable post loop vectorization
- Resolved
-
JDK-8344085 C2 SuperWord: improve vectorization for small loop iteration count
- Open
-
JDK-8153998 Masked vector post loops
- Resolved
-
JDK-8311691 C2: Remove legacy code related to PostLoopMultiversioning
- Resolved
-
JDK-8312332 C2: Refactor SWPointer out from SuperWord
- Resolved
-
JDK-8315361 C2 SuperWord: refactor out loop analysis into shared auto-vectorization facility VLoopAnalyzer
- Closed
- links to
-
Review openjdk/jdk/14581
(1 relates to, 1 links to)