- 
    
Enhancement
 - 
    Resolution: Unresolved
 - 
    
  P4                     
     - 
    26
 
- 
        aarch64
 
                    Found during work on JDK-8340093
The test has a few cases where we do not vectorize because of long mul reduction / element-wise vectors.
test/hotspot/jtreg/compiler/loopopts/superword/TestReductions.java
But TestReductions.longMulSimple does vectorize, but it leads to performance regressions compared to non-vectorized code.
We already saw this here:
https://github.com/openjdk/jdk/pull/25387
(We got only 0.38 of the scalar performance)
The issue seems to be this:
- Matcher::match_rule_supported_vector
- has a comment that says that 64/128bit vector reductions for MulReductionVL is supported
- Matcher::match_rule_supported_auto_vectorization
- excludes MulVL from vectorization, because apparently no NEON implementation is available.
- Well: in the backend, we implement both MulVL and MulReductionVL, but we do it with a scalar implementation: pack and unpack.
- that is very inefficent, and can lead to slowdowns. I wonder if that also has an impact on the Vector API, probably yes.
We have multiple options here:
- We can just prevent long mul reductions for NEON completely
- But: in some odd cases it may be profitable. And for that, we could just adjust the cost model: make MulVL and MulReductionVL more expensive in the cost model. This is probably the preferrable method.
Update: JDK-8340093 now just prevents a long-mul-reduction from inserting a MulVL in VTransformReductionVectorNode::optimize_move_non_strict_order_reductions_out_of_loop
That makes sure we don't have a regression for now. But there could now be edge cases that would prefer the MulVL... but probably almost none.
The test has a few cases where we do not vectorize because of long mul reduction / element-wise vectors.
test/hotspot/jtreg/compiler/loopopts/superword/TestReductions.java
But TestReductions.longMulSimple does vectorize, but it leads to performance regressions compared to non-vectorized code.
We already saw this here:
https://github.com/openjdk/jdk/pull/25387
(We got only 0.38 of the scalar performance)
The issue seems to be this:
- Matcher::match_rule_supported_vector
- has a comment that says that 64/128bit vector reductions for MulReductionVL is supported
- Matcher::match_rule_supported_auto_vectorization
- excludes MulVL from vectorization, because apparently no NEON implementation is available.
- Well: in the backend, we implement both MulVL and MulReductionVL, but we do it with a scalar implementation: pack and unpack.
- that is very inefficent, and can lead to slowdowns. I wonder if that also has an impact on the Vector API, probably yes.
We have multiple options here:
- We can just prevent long mul reductions for NEON completely
- But: in some odd cases it may be profitable. And for that, we could just adjust the cost model: make MulVL and MulReductionVL more expensive in the cost model. This is probably the preferrable method.
Update: JDK-8340093 now just prevents a long-mul-reduction from inserting a MulVL in VTransformReductionVectorNode::optimize_move_non_strict_order_reductions_out_of_loop
That makes sure we don't have a regression for now. But there could now be edge cases that would prefer the MulVL... but probably almost none.
- relates to
 - 
                    
JDK-8340093 C2 SuperWord: implement cost model
-         
     - Open
 
 -