Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8308994

C2: Re-implement experimental post loop vectorization

XMLWordPrintable

    • generic
    • generic

      Post loop vectorization takes advantage of vector mask (predicate) features of some hardware platforms, such as x86 AVX-512 and AArch64 SVE, to vectorize tail iterations of loops for better performance. The existing implementation in the C2 compiler has a long history. It was first implemented in JDK-8153998 in 2016 under a C2's experimental feature PostLoopMultiversioning to support x86 AVX-512 vector masks. Due to insufficient maintenance, it had been broken for a very long time. Last year, We took over JDK-8183390 to fix and re-enable this feature. Several issues were fixed and AArch64 vector mask support was added at that time. As we proposed to make post loop vectorization non-experimental in future JDK releases, we did some stress tests early in this year but found more problems inside. The problems include stability, maintainability and performance.

      1. Stability
      Multiple C2 crash or mis-compilation issues related to post loop vectorization were filed on JBS, including JDK-8301657, JDK-8301904, JDK-8301944, JDK-8304774, JDK-8308949 and perhaps more with recent C2 patches.

      2. Maintainability
      The original implementation is based on multi-versioned post loops and the code is mixed in SuperWord. But post loop vectorization does not actually use the SLP algorithm. So there is a lot of special handling for post loops in current SuperWord code. As more and more features are added in SuperWord, the legacy code is becoming more and more difficult to maintain and extend.

      3. Performance
      Post loop vectorization was expected to bring obvious performance benefit for small iteration loops. But JMH tests showed it didn't. A main reason is that the multi-versioned vector post loop is jumped over from main loop's minimum-trip guard if the whole loop has very few iterations (read JDK-8307084 to learn more). The previous implementation also has limited vectorization ability, such as it can only vectorize loop statements with single data size.

      For better stability, maintainability and performance, we now propose to deprecate current multi-versioning framework and completely re-implement the experimental post loop vectorization, for both x86 AVX-512 and AArch64 SVE. Our new proposal is to add a standalone ideal loop phase (outside SuperWord) to do vector mask transformation directly on the original scalar post loop.

      Patch for this is expected to be targeted to JDK 22.

            fgao Fei Gao
            pli Pengfei Li
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: