Currently, the aarch64 port cannot generate vectorized MLA/MLS instructions.
Let's take the hotspot/test/compiler/loopopts/superword/SumRed_Int.java as an example.
For the following code snippet produced by C2:
0x0000007f6cec12cc: mul v19.4s, v16.4s, v17.4s
0x0000007f6cec12d0: mul v16.4s, v16.4s, v18.4s
0x0000007f6cec12d4: mul v17.4s, v18.4s, v17.4s
0x0000007f6cec12d8: add v16.4s, v19.4s, v16.4s
0x0000007f6cec12dc: add v16.4s, v16.4s, v17.4s
It can be further optimized into:
0x0000007f9cdb86dc: mul v19.4s, v16.4s, v17.4s
0x0000007f9cdb86e0: mla v19.4s, v16.4s, v18.4s
0x0000007f9cdb86e4: mla v19.4s, v17.4s, v18.4s
I have a patch which adds support for vectorized MLA/MLS instructions. I will post it on the list soon.
Let's take the hotspot/test/compiler/loopopts/superword/SumRed_Int.java as an example.
For the following code snippet produced by C2:
0x0000007f6cec12cc: mul v19.4s, v16.4s, v17.4s
0x0000007f6cec12d0: mul v16.4s, v16.4s, v18.4s
0x0000007f6cec12d4: mul v17.4s, v18.4s, v17.4s
0x0000007f6cec12d8: add v16.4s, v19.4s, v16.4s
0x0000007f6cec12dc: add v16.4s, v16.4s, v17.4s
It can be further optimized into:
0x0000007f9cdb86dc: mul v19.4s, v16.4s, v17.4s
0x0000007f9cdb86e0: mla v19.4s, v16.4s, v18.4s
0x0000007f9cdb86e4: mla v19.4s, v17.4s, v18.4s
I have a patch which adds support for vectorized MLA/MLS instructions. I will post it on the list soon.