-
Enhancement
-
Resolution: Unresolved
-
P4
-
11, 17, 20, 21
In C2's loop optimization, a counted loop could be split into pre-/main-/post- loops. Meanwhile, C2 inserts minimum trip guards (a.k.a. zero-trip guards) before the main loop and the post loop. These guards test the remaining trip count of the loop. The execution jumps over the loop code if the remaining trip count is less than the loop stride (after unrolling) to avoid loop over-running. For example, if a main loop is unrolled 8x (and vectorized), the main loop guard tests if the loop has less than 8 iterations to run, as is shown in below figure (a).
Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. To avoid the scalar post loop running too much iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vector drain loop (a.k.a, atomic post loop). The newly inserted post loop also has a min-trip guard. And, both trip guards of the main loop and vector post loop jump to the scalar post loop, as is shown in below figure (b).
After the main loop is super-unrolled, the test in main loop trip guard will be updated. Suppose the super-unrolling count is 4 in this example, the trip guard will test if remaining trip is less than 8 * 4 = 32, as is shown in below figure (c).
The problem here is, if the iteration count of a loop is relatively small but larger than the vector length, the vector atomic post loop will never be executed, because the test of the main loop's trip guard fails and the atomic post loop is jumped over. For example, in above case, a loop still has 25 iterations after the pre-loop is executed, we may can run 3 trips of the atomic post loop but it's impossible. It would be better if the main loop's trip guard does not jump over the atomic post loop.
This issue does not produce any bug but fixing this can improve the performance of small trip count loop.
Usually, the vectorized main loop will be super-unrolled after vectorization. In such cases, the main loop's stride is going to be further multiplied. To avoid the scalar post loop running too much iterations after super-unrolling, C2 clones the main loop before super-unrolling to create a vector drain loop (a.k.a, atomic post loop). The newly inserted post loop also has a min-trip guard. And, both trip guards of the main loop and vector post loop jump to the scalar post loop, as is shown in below figure (b).
After the main loop is super-unrolled, the test in main loop trip guard will be updated. Suppose the super-unrolling count is 4 in this example, the trip guard will test if remaining trip is less than 8 * 4 = 32, as is shown in below figure (c).
The problem here is, if the iteration count of a loop is relatively small but larger than the vector length, the vector atomic post loop will never be executed, because the test of the main loop's trip guard fails and the atomic post loop is jumped over. For example, in above case, a loop still has 25 iterations after the pre-loop is executed, we may can run 3 trips of the atomic post loop but it's impossible. It would be better if the main loop's trip guard does not jump over the atomic post loop.
This issue does not produce any bug but fixing this can improve the performance of small trip count loop.
- relates to
-
JDK-8344085 C2 SuperWord: improve vectorization for small loop iteration count
- Open
-
JDK-8149421 Vectorized Post Loops
- Resolved
-
JDK-8151573 Multiversioning for range check elimination
- Resolved