-
Enhancement
-
Resolution: Unresolved
-
P4
-
24
SuperWord is relatively well suited for high iteration count, where we spend most time in the super-unrolled main-loop. This gives us good throughput.
But for small iteration count, there are some issues:
- We align the loop in pre-loop. This can be up to vector-length (number of elements in vector) many iterations.
- Then we check at the zero-trip guard of the main-loop if we have enough iterations left to enter the super-unrolled main-loop. This would require that we have super-unroll-factor * vector-length many iterations left. If we have fewer, we directly go into the post-loop and never enter vectorized code at all.
- This is why we have filed JDK-8307084: if we don't have enough iterations for the super-unrolled main-loop, we should at least enter the vectorized post-loop. That has lower throughput, but is still good for not just cleanup after the main-loop but also for small iteration counts where we cannot use the main-loop.
One idea was JDK-8308994. That may work for some platforms - though it is maybe not worth the complexity. Still, they gathered some interesting data in the plots / benchmarks, see:
https://github.com/openjdk/jdk/pull/14581
Look at the saw-tooth plots it produces: we see that there are different phases.
We should have such a nice benchmark, so that we can experiment well.
Another issue is the pre-loop: it aligns the vectors for the main-loop. But it seems that on modern CPU's such alignment is not as performance relevant as on older CPU's. So here we really waste some possible iterations we might need to enter vectorized loops.
But for small iteration count, there are some issues:
- We align the loop in pre-loop. This can be up to vector-length (number of elements in vector) many iterations.
- Then we check at the zero-trip guard of the main-loop if we have enough iterations left to enter the super-unrolled main-loop. This would require that we have super-unroll-factor * vector-length many iterations left. If we have fewer, we directly go into the post-loop and never enter vectorized code at all.
- This is why we have filed JDK-8307084: if we don't have enough iterations for the super-unrolled main-loop, we should at least enter the vectorized post-loop. That has lower throughput, but is still good for not just cleanup after the main-loop but also for small iteration counts where we cannot use the main-loop.
One idea was JDK-8308994. That may work for some platforms - though it is maybe not worth the complexity. Still, they gathered some interesting data in the plots / benchmarks, see:
https://github.com/openjdk/jdk/pull/14581
Look at the saw-tooth plots it produces: we see that there are different phases.
We should have such a nice benchmark, so that we can experiment well.
Another issue is the pre-loop: it aligns the vectors for the main-loop. But it seems that on modern CPU's such alignment is not as performance relevant as on older CPU's. So here we really waste some possible iterations we might need to enter vectorized loops.
- relates to
-
JDK-8344118 C2 SuperWord: add VectorThroughputForIterationCount benchmark
-
- Resolved
-
-
JDK-8299808 C2 SuperWord: investigate performance difference to ArrayFill
-
- Open
-
-
JDK-8342692 C2: long counted loop/long range checks: don't create loop-nest for short running loops
-
- Open
-
-
JDK-8308994 C2: Re-implement experimental post loop vectorization
-
- In Progress
-
-
JDK-8307084 C2: Vectorized drain loop is not executed for some small trip counts
-
- In Progress
-