-
Enhancement
-
Resolution: Unresolved
-
P4
-
24, 25
-
generic
-
generic
For following vector kernel, we currently create multiple AddVI IR nodes
which only differ in the ordering of input. This results into emission of additional VPADDD instruction and impact the dynamic instruction count.
public class test_node_sharing {
public static final VectorSpecies<Integer> ISP = IntVector.SPECIES_PREFERRED;
public static void micro(int [] res, int [] src1, int [] src2) {
for (int i = 0; i < ISP.loopBound(res.length); i += ISP.length()) {
IntVector vec1 = IntVector.fromArray(ISP, src1, i);
IntVector vec2 = IntVector.fromArray(ISP, src2, i);
// Parallel dispatch over two exeuction ports will absrob any impact on
// latency of the worklet, but path length will improve.
IntVector vec3 = vec1.lanewise(VectorOperators.ADD, vec2);
IntVector vec4 = vec2.lanewise(VectorOperators.ADD, vec1);
vec4.lanewise(VectorOperators.ADD, vec3)
.intoArray(res, i);
}
}
CPROMPT>java -XX:LoopUnrollLimit=0 --add-modules=jdk.incubator.vector -XX:CompileCommand=PrintIdeal,test_node_sharing::micro -cp . test_node_sharing | grep AddVI
WARNING: Using incubator modules: jdk.incubator.vector
107 AddVI === _ 155 156 [[ 52 61 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:67 (line 17)
155 AddVI === _ 199 115 [[ 107 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:55 (line 16)
156 AddVI === _ 115 199 [[ 107 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:43 (line 15)
91 AddVI === _ 127 128 [[ 57 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:67 (line 17)
127 AddVI === _ 292 246 [[ 91 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:55 (line 16)
128 AddVI === _ 246 292 [[ 91 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:43 (line 15)
CPROMPT>
CPROMPT>
CPROMPT>java -XX:LoopUnrollLimit=0 --add-modules=jdk.incubator.vector -XX:CompileCommand=Print,test_node_sharing::micro -cp . test_node_sharing | grep vpaddd
WARNING: Using incubator modules: jdk.incubator.vector
154 vpaddd XMM1,XMM0,XMM2 ! add packedI
15a vpaddd XMM0,XMM2,XMM0 ! add packedI
160 vpaddd XMM0,XMM1,XMM0 ! add packedI
0x00007fd2881b10b4: vpaddd %zmm2,%zmm0,%zmm1
0x00007fd2881b10ba: vpaddd %zmm0,%zmm2,%zmm0 ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0}
0x00007fd2881b10c0: vpaddd %zmm0,%zmm1,%zmm0 ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0}
116 vpaddd XMM2,XMM1,XMM0 ! add packedI
11c vpaddd XMM0,XMM0,XMM1 ! add packedI
122 vpaddd XMM0,XMM2,XMM0 ! add packedI
0x00007fd2881b16d6: vpaddd %zmm0,%zmm1,%zmm2
0x00007fd2881b16dc: vpaddd %zmm1,%zmm0,%zmm0
0x00007fd2881b16e2: vpaddd %zmm0,%zmm2,%zmm0
CPROMPT>
There is a scope of sharing Vector IR nodes for commutative operators.
which only differ in the ordering of input. This results into emission of additional VPADDD instruction and impact the dynamic instruction count.
public class test_node_sharing {
public static final VectorSpecies<Integer> ISP = IntVector.SPECIES_PREFERRED;
public static void micro(int [] res, int [] src1, int [] src2) {
for (int i = 0; i < ISP.loopBound(res.length); i += ISP.length()) {
IntVector vec1 = IntVector.fromArray(ISP, src1, i);
IntVector vec2 = IntVector.fromArray(ISP, src2, i);
// Parallel dispatch over two exeuction ports will absrob any impact on
// latency of the worklet, but path length will improve.
IntVector vec3 = vec1.lanewise(VectorOperators.ADD, vec2);
IntVector vec4 = vec2.lanewise(VectorOperators.ADD, vec1);
vec4.lanewise(VectorOperators.ADD, vec3)
.intoArray(res, i);
}
}
CPROMPT>java -XX:LoopUnrollLimit=0 --add-modules=jdk.incubator.vector -XX:CompileCommand=PrintIdeal,test_node_sharing::micro -cp . test_node_sharing | grep AddVI
WARNING: Using incubator modules: jdk.incubator.vector
107 AddVI === _ 155 156 [[ 52 61 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:67 (line 17)
155 AddVI === _ 199 115 [[ 107 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:55 (line 16)
156 AddVI === _ 115 199 [[ 107 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:43 (line 15)
91 AddVI === _ 127 128 [[ 57 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:67 (line 17)
127 AddVI === _ 292 246 [[ 91 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:55 (line 16)
128 AddVI === _ 246 292 [[ 91 ]] #vectorz[16]:{int} !jvms: IntVector::lanewiseTemplate @ bci:154 (line 784) Int512Vector::lanewise @ bci:3 (line 285) Int512Vector::lanewise @ bci:3 (line 41) test_node_sharing::micro @ bci:43 (line 15)
CPROMPT>
CPROMPT>
CPROMPT>java -XX:LoopUnrollLimit=0 --add-modules=jdk.incubator.vector -XX:CompileCommand=Print,test_node_sharing::micro -cp . test_node_sharing | grep vpaddd
WARNING: Using incubator modules: jdk.incubator.vector
154 vpaddd XMM1,XMM0,XMM2 ! add packedI
15a vpaddd XMM0,XMM2,XMM0 ! add packedI
160 vpaddd XMM0,XMM1,XMM0 ! add packedI
0x00007fd2881b10b4: vpaddd %zmm2,%zmm0,%zmm1
0x00007fd2881b10ba: vpaddd %zmm0,%zmm2,%zmm0 ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0}
0x00007fd2881b10c0: vpaddd %zmm0,%zmm1,%zmm0 ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0}
116 vpaddd XMM2,XMM1,XMM0 ! add packedI
11c vpaddd XMM0,XMM0,XMM1 ! add packedI
122 vpaddd XMM0,XMM2,XMM0 ! add packedI
0x00007fd2881b16d6: vpaddd %zmm0,%zmm1,%zmm2
0x00007fd2881b16dc: vpaddd %zmm1,%zmm0,%zmm0
0x00007fd2881b16e2: vpaddd %zmm0,%zmm2,%zmm0
CPROMPT>
There is a scope of sharing Vector IR nodes for commutative operators.
- relates to
-
JDK-8348134 Promote scalar IR node sharing using Node::Flag_is_commutative_vector_op
- Open
- links to
-
Review(master) openjdk/jdk/22863