-
Enhancement
-
Resolution: Unresolved
-
P4
-
24
[~qamai] had this idea, I'm filing it for him.
When we vectorize reductions, we try to move them out of the loop, see PhaseIdealLoop::move_unordered_reduction_out_of_loop introduced inJDK-8302652 / https://github.com/openjdk/jdk/pull/13056.
That still leaves us with a chain of vector-adds, which can limit the latency. I'm copying this from elsewhere:
[~qamai]:
Reassociation idea: Reduction loop is latency-bound, so we can reassociate the operations of an unrolled loop to saturate the ALU and load/store units. E.g: transforming x4 + (x3 + (x2 + (x1 + x))) into x + (x4 + (x3 + (x2 + x1))). This should be easier and introduce less register pressure compared to having several dedicated reduction lanes.
[~epeter]:
Ok, yes. After moving the reduction out of the loop, we now have add-vectors in a sequence.
This has high latency. We could further improve things this way:
- give each its own phi -> smaller latency but requires more registers
- reassociate them -> if we do it right, i.e. xv = xv + (xv4 + (xv3 + (xv2 + xv1))), then the latency is still minimal, but the register pressure on the backedge is smaller. Nice idea!
I may soon refactor away PhaseIdealLoop::move_unordered_reduction_out_of_loop, and move it into VLoop::optimize, so we can already predict during auto-vectoirzation if we can move the reduction nodes out of the loop, which makes vectorization more profitable.
So this optimization would have to be a stand-alone. Maybe it could be done in in IGVN after loop-opts, when we are done super-unrolling.
It would require that we find a benchmark where the reduction latency is the bottleneck, and not any other computation or memory operation.
When we vectorize reductions, we try to move them out of the loop, see PhaseIdealLoop::move_unordered_reduction_out_of_loop introduced in
That still leaves us with a chain of vector-adds, which can limit the latency. I'm copying this from elsewhere:
[~qamai]:
Reassociation idea: Reduction loop is latency-bound, so we can reassociate the operations of an unrolled loop to saturate the ALU and load/store units. E.g: transforming x4 + (x3 + (x2 + (x1 + x))) into x + (x4 + (x3 + (x2 + x1))). This should be easier and introduce less register pressure compared to having several dedicated reduction lanes.
[~epeter]:
Ok, yes. After moving the reduction out of the loop, we now have add-vectors in a sequence.
This has high latency. We could further improve things this way:
- give each its own phi -> smaller latency but requires more registers
- reassociate them -> if we do it right, i.e. xv = xv + (xv4 + (xv3 + (xv2 + xv1))), then the latency is still minimal, but the register pressure on the backedge is smaller. Nice idea!
I may soon refactor away PhaseIdealLoop::move_unordered_reduction_out_of_loop, and move it into VLoop::optimize, so we can already predict during auto-vectoirzation if we can move the reduction nodes out of the loop, which makes vectorization more profitable.
So this optimization would have to be a stand-alone. Maybe it could be done in in IGVN after loop-opts, when we are done super-unrolling.
It would require that we find a benchmark where the reduction latency is the bottleneck, and not any other computation or memory operation.
- relates to
-
JDK-8345044 Sum of array elements not vectorized
-
- Open
-
-
JDK-8302652 [SuperWord] Reduction should happen after loop, when possible
-
- Resolved
-