Pseudocode:
acc = init
For (i ...) {
vec = "some vector ops"; // vec holds vector of results from this iteration
vector_reduction(vec, acc); // reduces vector vec into scalar accumulator acc
}
// use acc
However, in integerreductions, and some floatingpoint reductions that do not require the linear order (Min / Max), we can do better. We can use a vectoraccumulator in the loop, and do the reduction on this vector only after the loop. This should significantly reduce the work per loop iteration.
v_acc = scalar_to_vector(init); // depends on reduction op how we would do this
For (i ...) {
vec = "some vector ops"; // vec holds vector of results from this iteration
v_acc = vector_elememt_wise_reduction(v_acc, vec);
}
acc = vector_reduction(v_acc);
// use acc
Note: we already have different reduction implementations.
We already do a "recursive folding" for ints (C2_MacroAssembler::reduce8I), and a "linear folding" for floats (C2_MacroAssembler::reduce8F).
https://github.com/openjdk/jdk/blob/db1b48ef3bb4f8f0fbb6879200c0655b7fe006eb/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L1895L1941
https://github.com/openjdk/jdk/blob/db1b48ef3bb4f8f0fbb6879200c0655b7fe006eb/src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp#L2096L2120
I found this while working on JDK8302139, where I implemented an IR test for SuperWord reductions, and checked out the generated code.
