Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Unresolved
Priority: P3
Fix Version/s: tbd
Affects Version/s: 21
Component/s: hotspot
Labels:

Subcomponent:
compiler
CPU:

generic
OS:

generic

Post loop vectorization takes advantage of vector mask (predicate) features of some hardware platforms, such as x86 AVX-512 and AArch64 SVE, to vectorize tail iterations of loops for better performance. The existing implementation in the C2 compiler has a long history. It was first implemented in ~~JDK-8153998~~ in 2016 under a C2's experimental feature PostLoopMultiversioning to support x86 AVX-512 vector masks. Due to insufficient maintenance, it had been broken for a very long time. Last year, We took over ~~JDK-8183390~~ to fix and re-enable this feature. Several issues were fixed and AArch64 vector mask support was added at that time. As we proposed to make post loop vectorization non-experimental in future JDK releases, we did some stress tests early in this year but found more problems inside. The problems include stability, maintainability and performance.

1. Stability
Multiple C2 crash or mis-compilation issues related to post loop vectorization were filed on JBS, including ~~JDK-8301657~~, ~~JDK-8301904~~, ~~JDK-8301944~~, ~~JDK-8304774~~, ~~JDK-8308949~~ and perhaps more with recent C2 patches.

2. Maintainability
The original implementation is based on multi-versioned post loops and the code is mixed in SuperWord. But post loop vectorization does not actually use the SLP algorithm. So there is a lot of special handling for post loops in current SuperWord code. As more and more features are added in SuperWord, the legacy code is becoming more and more difficult to maintain and extend.

3. Performance
Post loop vectorization was expected to bring obvious performance benefit for small iteration loops. But JMH tests showed it didn't. A main reason is that the multi-versioned vector post loop is jumped over from main loop's minimum-trip guard if the whole loop has very few iterations (read JDK-8307084 to learn more). The previous implementation also has limited vectorization ability, such as it can only vectorize loop statements with single data size.

For better stability, maintainability and performance, we now propose to deprecate current multi-versioning framework and completely re-implement the experimental post loop vectorization, for both x86 AVX-512 and AArch64 SVE. Our new proposal is to add a standalone ideal loop phase (outside SuperWord) to do vector mask transformation directly on the original scalar post loop.

Patch for this is expected to be targeted to JDK 22.

relates to

JDK-8183390 Fix and re-enable post loop vectorization

Resolved

JDK-8344085 C2 SuperWord: improve vectorization for small loop iteration count

Open

JDK-8153998 Masked vector post loops

Resolved

JDK-8311691 C2: Remove legacy code related to PostLoopMultiversioning

Resolved

JDK-8312332 C2: Refactor SWPointer out from SuperWord

Resolved

JDK-8315361 C2 SuperWord: refactor out loop analysis into shared auto-vectorization facility VLoopAnalyzer

Closed

links to

Review openjdk/jdk/14581

(1 relates to, 1 links to)

Assignee:: Fei Gao

Reporter:: Pengfei Li

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023-05-28 19:27

Updated:: 2024-11-12 23:05

Details

Description

Attachments

Issue Links

Activity

People

Dates