Details

Enhancement

Resolution: Fixed

P4

19

b21
Description
(1) identify and mark reduction nodes and their corresponding loops early (to inform the unrolling policy),
(2) reuse later this information (which nodes are marked as reductions) to guide SLP vectorization.
By performing (1) before loop unrolling, this approach can identify reductions in a simple and efficient manner. However, it has a serious drawback that makes it hard to maintain: many different highlevel loop transformations and lowlevel Ideal transformations can invalidate the node and loop reduction marks (flags) within the loop optimization iterations between steps (1) and (2) above, leading in the worst case to SLP miscompilations based on inconsistent reduction information.
A potential solution is to run step (1) ondemand: first to inform the unrolling policy, and then again to guide SLP vectorization. This would avoid this source of bugs by construction, because there would not be analysis information "on the side" that could be potentially invalidated. If a node found to be a reduction before unrolling stops being a reduction due to some loop or Ideal transformation, it will simply not be identified and treated as a reduction by SLP vectorization.
The solution suggested in this RFE requires generalizing reduction analysis to also identify reductions on unrolled loops. For example, the analysis should be able to detect that node (166 AddI) in reductionbeforeunrolling.pdf (attached) is a reduction, and also that nodes {(166 AddI, 504 AddI, 579 AddI, 591 AddI)} form a reduction chain in the corresponding unrolled loop reductionunrolledx4.pdf. While detecting reduction chains in unrolled loops is necessarily more costly, the cost is never worse than ("number of phi nodes in the loop" x LoopMaxUnroll), under the assumption that all nodes in an unrolled reduction chain are connected via the same input number. This assumption is key to avoid a combinatorial explosion, and seems to hold in practice, since reduction chains are formed by cloning the same reduction node a number of times.
A prototype of this alternative design is available at https://github.com/robcasloz/jdk/tree/JDK8287087.
Work left to do:
 Evaluate the overhead of doing general reduction analysis on unrolled loops and its contribution to total C2 execution time (https://github.com/robcasloz/jdk/blob/2171ced27b6a99651b194ab51919c68ade5afb23/src/hotspot/share/opto/superword.cpp#L115).
 Evaluate the overhead of doing pernode general reduction analysis for x8664 min/max floatingpoint intrinsic selection (https://github.com/robcasloz/jdk/blob/2171ced27b6a99651b194ab51919c68ade5afb23/src/hotspot/share/opto/node.cpp#L3236).
 Test that the same reduction loops are vectorized as before. As part of this effort, IR test framework checks could be added to the reduction vectorization test cases for more robust regression testing. Some examples can be found in the prototype, e.g. https://github.com/robcasloz/jdk/blob/generalreductionanalysis/test/hotspot/jtreg/compiler/loopopts/superword/RedTest_int_x64.java. This would require additional support IR test framework for multitarget IR matching, see
 Test that the x8664 min/max floatingpoint intrinsic selection works at least as well as (and possibly better than) the original implementation. The potential improvement over the original implementation comes from the fact that ondemand reduction analysis would be performed on *every* min/max floatingpoint node visited during instruction selection, not just on those visited in the context of SLP analysis.
Attachments
Issue Links
 is blocked by

JDK8294715 Add IR checks to the reduction vectorization tests
 Resolved
 relates to

JDK8279622 C2: miscompilation of map pattern as a vector reduction
 Resolved

JDK8286177 C2: "failed: nonreduction loop contains reduction nodes" assert failure
 Resolved

JDK8308746 C2 IR test failures for TestFpMinMaxReductions.java with SSE2
 Resolved

JDK8261147 C2: Node is wrongly marked as reduction resulting in a wrong execution due to wrong vector instructions
 Closed

JDK8305707 SuperWord should vectorize reverseorder reduction loops
 Open

JDK8306989 C2: generalize search in SLP reduction analysis
 Open

JDK8074981 Integer/FP scalar reduction optimization
 Resolved

JDK8217561 x86: Add floatingpoint Math.min/max intrinsics
 Resolved

JDK8290964 C2 compilation fails with assert "nonreduction loop contains reduction nodes"
 Resolved

JDK8317507 C2 compilation fails with "Exceeded _node_regs array"
 Closed