-
Enhancement
-
Resolution: Unresolved
-
P4
-
26
The flag LoopMaxUnroll currently behaves a bit unexpected, and does not allow explicit control over unrolling before vectorization and after vectorization.
Such control would be quite helpful to debug performance issues, such as encountered during JDK-8367158, where we want to compare different unrolling factors of vectorized code. For example we may want to compare auto-vectorized code of different super-unrolling factors with fill/copy intrinsics.
--------------------- How LoopMaxUnroll currently works ----------------
The description of the flag is a bit weak:
200 product(intx, LoopMaxUnroll, 16, \
201 "Maximum number of unrolls for main loop") \
202 range(0, max_jint) \
It seems to suggest it might cover the factor of "pre-vec-unroll * super-unroll", i.e. total unroll. But that does not seem to be the case.
In IdealLoopTree::policy_unroll
977 _local_loop_unroll_limit = LoopUnrollLimit;
978 _local_loop_unroll_factor = 4;
979 int future_unroll_cnt = cl->unrolled_count() * 2;
980 if (!cl->is_vectorized_loop()) {
981 if (future_unroll_cnt > LoopMaxUnroll) return false;
982 } else {
983 // obey user constraints on vector mapped loops with additional unrolling applied
984 int unroll_constraint = (cl->slp_max_unroll()) ? cl->slp_max_unroll() : 1;
985 if ((future_unroll_cnt / unroll_constraint) > LoopMaxUnroll) return false;
986 }
So when we are checking if we can unroll, there are these cases:
- scalar loop -> do not unroll more than LoopMaxUnroll
- vector main loop -> do not unroll more than LoopMaxUnroll*slp_max_unroll
- vector drain loop -> do not unroll more than LoopMaxUnroll. Q: can that lead to super-unrolling? Probably not...?
It seems we also always set _local_loop_unroll_factor = 4, which mal be relevant below.
1116 if (phase->C->do_superword()) {
1117 // Only attempt slp analysis when user controls do not prohibit it
1118 if (!range_checks_present() && (LoopMaxUnroll > _local_loop_unroll_factor)) {
1119 // Once policy_slp_analysis succeeds, mark the loop with the
1120 // maximal unroll factor so that we minimize analysis passes
1121 if (future_unroll_cnt >= _local_loop_unroll_factor) {
1122 policy_unroll_slp_analysis(cl, phase, future_unroll_cnt);
1123 }
1124 }
1125 }
So LoopMaxUnroll must be 8 or larger, otherwise we don't do the policy_unroll_slp_analysis. Curious!
-> Investigate!
1127 int slp_max_unroll_factor = cl->slp_max_unroll();
1128 if ((LoopMaxUnroll < slp_max_unroll_factor) && FLAG_IS_DEFAULT(LoopMaxUnroll) && UseSubwordForMaxVector) {
1129 LoopMaxUnroll = slp_max_unroll_factor;
1130 }
We may now update the flag, if still in default mode. But this is a global update... so next time we come through here we cannot update it any more, right? Does this look good at all?
-> Investigation: ok, it does get updated repeatedly. I played around with an example like TestLoopMaxUnrollIncreasing.java
We also use the flag to limit the search for reduction chains in SuperWord:
src/hotspot/share/opto/superword.cpp: PathEnd path_to_phi = find_in_path(n, input, LoopMaxUnroll, has_my_opcode,
src/hotspot/share/opto/superword.cpp: PathEnd path_from_phi = find_in_path(first, input, LoopMaxUnroll, has_my_opcode,
policy_unroll_slp_analysis
- unrolling_analysis sets _local_loop_unroll_factor, mark_passed_slp, and set_slp_max_unroll.
- Maybe we can refactor the code, so we don't set random states all over the place?
- eventually, we then set the _local_loop_unroll_limit, which seems to be the real limit... ah but that is a node limit??? a bit strange... and could possibly lead to inaccurate super-unrolling. The heuristic is also odd.
-> this probably leads to the horrible over-unrolling we have seen when we pass slp analysis but fail to vectorize!
TODO: policy_unroll_slp_analysis - meaning of related fields
TODO: consider deprecating UseSubwordForMaxVector - I see no reason to disable it ever. We also have no tests for its correctness if disabled.
TODO: consider deprecating SuperWordLoopUnrollAnalysis - ah but it is false on arm only... strange!
Such control would be quite helpful to debug performance issues, such as encountered during JDK-8367158, where we want to compare different unrolling factors of vectorized code. For example we may want to compare auto-vectorized code of different super-unrolling factors with fill/copy intrinsics.
--------------------- How LoopMaxUnroll currently works ----------------
The description of the flag is a bit weak:
200 product(intx, LoopMaxUnroll, 16, \
201 "Maximum number of unrolls for main loop") \
202 range(0, max_jint) \
It seems to suggest it might cover the factor of "pre-vec-unroll * super-unroll", i.e. total unroll. But that does not seem to be the case.
In IdealLoopTree::policy_unroll
977 _local_loop_unroll_limit = LoopUnrollLimit;
978 _local_loop_unroll_factor = 4;
979 int future_unroll_cnt = cl->unrolled_count() * 2;
980 if (!cl->is_vectorized_loop()) {
981 if (future_unroll_cnt > LoopMaxUnroll) return false;
982 } else {
983 // obey user constraints on vector mapped loops with additional unrolling applied
984 int unroll_constraint = (cl->slp_max_unroll()) ? cl->slp_max_unroll() : 1;
985 if ((future_unroll_cnt / unroll_constraint) > LoopMaxUnroll) return false;
986 }
So when we are checking if we can unroll, there are these cases:
- scalar loop -> do not unroll more than LoopMaxUnroll
- vector main loop -> do not unroll more than LoopMaxUnroll*slp_max_unroll
- vector drain loop -> do not unroll more than LoopMaxUnroll. Q: can that lead to super-unrolling? Probably not...?
It seems we also always set _local_loop_unroll_factor = 4, which mal be relevant below.
1116 if (phase->C->do_superword()) {
1117 // Only attempt slp analysis when user controls do not prohibit it
1118 if (!range_checks_present() && (LoopMaxUnroll > _local_loop_unroll_factor)) {
1119 // Once policy_slp_analysis succeeds, mark the loop with the
1120 // maximal unroll factor so that we minimize analysis passes
1121 if (future_unroll_cnt >= _local_loop_unroll_factor) {
1122 policy_unroll_slp_analysis(cl, phase, future_unroll_cnt);
1123 }
1124 }
1125 }
So LoopMaxUnroll must be 8 or larger, otherwise we don't do the policy_unroll_slp_analysis. Curious!
-> Investigate!
1127 int slp_max_unroll_factor = cl->slp_max_unroll();
1128 if ((LoopMaxUnroll < slp_max_unroll_factor) && FLAG_IS_DEFAULT(LoopMaxUnroll) && UseSubwordForMaxVector) {
1129 LoopMaxUnroll = slp_max_unroll_factor;
1130 }
We may now update the flag, if still in default mode. But this is a global update... so next time we come through here we cannot update it any more, right? Does this look good at all?
-> Investigation: ok, it does get updated repeatedly. I played around with an example like TestLoopMaxUnrollIncreasing.java
We also use the flag to limit the search for reduction chains in SuperWord:
src/hotspot/share/opto/superword.cpp: PathEnd path_to_phi = find_in_path(n, input, LoopMaxUnroll, has_my_opcode,
src/hotspot/share/opto/superword.cpp: PathEnd path_from_phi = find_in_path(first, input, LoopMaxUnroll, has_my_opcode,
policy_unroll_slp_analysis
- unrolling_analysis sets _local_loop_unroll_factor, mark_passed_slp, and set_slp_max_unroll.
- Maybe we can refactor the code, so we don't set random states all over the place?
- eventually, we then set the _local_loop_unroll_limit, which seems to be the real limit... ah but that is a node limit??? a bit strange... and could possibly lead to inaccurate super-unrolling. The heuristic is also odd.
-> this probably leads to the horrible over-unrolling we have seen when we pass slp analysis but fail to vectorize!
TODO: policy_unroll_slp_analysis - meaning of related fields
TODO: consider deprecating UseSubwordForMaxVector - I see no reason to disable it ever. We also have no tests for its correctness if disabled.
TODO: consider deprecating SuperWordLoopUnrollAnalysis - ah but it is false on arm only... strange!
- relates to
-
JDK-8367158 C2: create better fill and copy benchmarks, taking alignment into account
-
- Open
-
-
JDK-8129920 Vectorized loop unrolling
-
- Resolved
-
-
JDK-8187601 Unrolling more when SLP auto-vectorization failed
-
- Resolved
-