-
Enhancement
-
Resolution: Fixed
-
P4
-
19
-
b21
(1) identify and mark reduction nodes and their corresponding loops early (to inform the unrolling policy),
(2) reuse later this information (which nodes are marked as reductions) to guide SLP vectorization.
By performing (1) before loop unrolling, this approach can identify reductions in a simple and efficient manner. However, it has a serious drawback that makes it hard to maintain: many different high-level loop transformations and low-level Ideal transformations can invalidate the node and loop reduction marks (flags) within the loop optimization iterations between steps (1) and (2) above, leading in the worst case to SLP miscompilations based on inconsistent reduction information.
A potential solution is to run step (1) on-demand: first to inform the unrolling policy, and then again to guide SLP vectorization. This would avoid this source of bugs by construction, because there would not be analysis information "on the side" that could be potentially invalidated. If a node found to be a reduction before unrolling stops being a reduction due to some loop or Ideal transformation, it will simply not be identified and treated as a reduction by SLP vectorization.
The solution suggested in this RFE requires generalizing reduction analysis to also identify reductions on unrolled loops. For example, the analysis should be able to detect that node (166 AddI) in reduction-before-unrolling.pdf (attached) is a reduction, and also that nodes {(166 AddI, 504 AddI, 579 AddI, 591 AddI)} form a reduction chain in the corresponding unrolled loop reduction-unrolled-x4.pdf. While detecting reduction chains in unrolled loops is necessarily more costly, the cost is never worse than ("number of phi nodes in the loop" x LoopMaxUnroll), under the assumption that all nodes in an unrolled reduction chain are connected via the same input number. This assumption is key to avoid a combinatorial explosion, and seems to hold in practice, since reduction chains are formed by cloning the same reduction node a number of times.
A prototype of this alternative design is available at https://github.com/robcasloz/jdk/tree/JDK-8287087.
Work left to do:
- Evaluate the overhead of doing general reduction analysis on unrolled loops and its contribution to total C2 execution time (https://github.com/robcasloz/jdk/blob/2171ced27b6a99651b194ab51919c68ade5afb23/src/hotspot/share/opto/superword.cpp#L115).
- Evaluate the overhead of doing per-node general reduction analysis for x86-64 min/max floating-point intrinsic selection (https://github.com/robcasloz/jdk/blob/2171ced27b6a99651b194ab51919c68ade5afb23/src/hotspot/share/opto/node.cpp#L3236).
- Test that the same reduction loops are vectorized as before. As part of this effort, IR test framework checks could be added to the reduction vectorization test cases for more robust regression testing. Some examples can be found in the prototype, e.g. https://github.com/robcasloz/jdk/blob/general-reduction-analysis/test/hotspot/jtreg/compiler/loopopts/superword/RedTest_int_x64.java. This would require additional support IR test framework for multi-target IR matching, see
- Test that the x86-64 min/max floating-point intrinsic selection works at least as well as (and possibly better than) the original implementation. The potential improvement over the original implementation comes from the fact that on-demand reduction analysis would be performed on *every* min/max floating-point node visited during instruction selection, not just on those visited in the context of SLP analysis.
- is blocked by
-
JDK-8294715 Add IR checks to the reduction vectorization tests
- Resolved
- relates to
-
JDK-8279622 C2: miscompilation of map pattern as a vector reduction
- Resolved
-
JDK-8286177 C2: "failed: non-reduction loop contains reduction nodes" assert failure
- Resolved
-
JDK-8308746 C2 IR test failures for TestFpMinMaxReductions.java with SSE2
- Resolved
-
JDK-8261147 C2: Node is wrongly marked as reduction resulting in a wrong execution due to wrong vector instructions
- Closed
-
JDK-8305707 SuperWord should vectorize reverse-order reduction loops
- Open
-
JDK-8306989 C2: generalize search in SLP reduction analysis
- Open
-
JDK-8074981 Integer/FP scalar reduction optimization
- Resolved
-
JDK-8217561 x86: Add floating-point Math.min/max intrinsics
- Resolved
-
JDK-8290964 C2 compilation fails with assert "non-reduction loop contains reduction nodes"
- Resolved
-
JDK-8317507 C2 compilation fails with "Exceeded _node_regs array"
- Closed