-
Type:
Enhancement
-
Resolution: Unresolved
-
Priority:
P4
-
Affects Version/s: 26
-
Component/s: hotspot
-
aarch64
Found during work on JDK-8340093
The test has a few cases where we do not vectorize because of long mul reduction / element-wise vectors.
test/hotspot/jtreg/compiler/loopopts/superword/TestReductions.java
But TestReductions.longMulSimple does vectorize, but it leads to performance regressions compared to non-vectorized code.
We already saw this here:
https://github.com/openjdk/jdk/pull/25387
(We got only 0.38 of the scalar performance)
The issue seems to be this:
- Matcher::match_rule_supported_vector
- has a comment that says that 64/128bit vector reductions for MulReductionVL is supported
- Matcher::match_rule_supported_auto_vectorization
- excludes MulVL from vectorization, because apparently no NEON implementation is available.
- Well: in the backend, we implement both MulVL and MulReductionVL, but we do it with a scalar implementation: pack and unpack.
- that is very inefficent, and can lead to slowdowns. I wonder if that also has an impact on the Vector API, probably yes.
We have multiple options here:
- We can just prevent long mul reductions for NEON completely
- But: in some odd cases it may be profitable. And for that, we could just adjust the cost model: make MulVL and MulReductionVL more expensive in the cost model. This is probably the preferrable method.
Update:JDK-8340093 now just prevents a long-mul-reduction from inserting a MulVL in VTransformReductionVectorNode::optimize_move_non_strict_order_reductions_out_of_loop
That makes sure we don't have a regression for now. But there could now be edge cases that would prefer the MulVL... but probably almost none.
The test has a few cases where we do not vectorize because of long mul reduction / element-wise vectors.
test/hotspot/jtreg/compiler/loopopts/superword/TestReductions.java
But TestReductions.longMulSimple does vectorize, but it leads to performance regressions compared to non-vectorized code.
We already saw this here:
https://github.com/openjdk/jdk/pull/25387
(We got only 0.38 of the scalar performance)
The issue seems to be this:
- Matcher::match_rule_supported_vector
- has a comment that says that 64/128bit vector reductions for MulReductionVL is supported
- Matcher::match_rule_supported_auto_vectorization
- excludes MulVL from vectorization, because apparently no NEON implementation is available.
- Well: in the backend, we implement both MulVL and MulReductionVL, but we do it with a scalar implementation: pack and unpack.
- that is very inefficent, and can lead to slowdowns. I wonder if that also has an impact on the Vector API, probably yes.
We have multiple options here:
- We can just prevent long mul reductions for NEON completely
- But: in some odd cases it may be profitable. And for that, we could just adjust the cost model: make MulVL and MulReductionVL more expensive in the cost model. This is probably the preferrable method.
Update:
That makes sure we don't have a regression for now. But there could now be edge cases that would prefer the MulVL... but probably almost none.
- relates to
-
JDK-8340093 C2 SuperWord: implement cost model
-
- Resolved
-