Currently, for all CPUs, if matching RotateLeftV and RotateRightV with rules in AD files, one has to implement both immediate and variable versions.
On aarch64, with match rules for vector rotation, immediate vector rotatation can be optimized with shift+insert instructions (i.e. SLI/SRI, ~23% improvements with an initial implementation).
However there woule be performance regression for variable version, due to SLI/SRI have no register version in NEON intruction set and there is no register version for right shift neither.
The instructions for match rules of vector rotate variable should be:
# this is the performance regression, loop invairables can't be extracted outside a loop.
mov w9, 32
dup v13.4s, w9
sub v20.4s, v13.4S, v19.4s
----------------------------
sshl v17.4s, v16.4s, v19.4s
neg v18.16b, v20.16b # on aarch64, vector right shift is implemented as left shift by negative shift count
ushl v16.4s, v16.4s, v18.4s
orr v16.16b, v17.16b, v16.16b
The immediate vector rotation should be splitted from RotateLeftV and RotateRightV nodes, so that they can be matched and optimized alone on CPUs like aarch64.
On aarch64, with match rules for vector rotation, immediate vector rotatation can be optimized with shift+insert instructions (i.e. SLI/SRI, ~23% improvements with an initial implementation).
However there woule be performance regression for variable version, due to SLI/SRI have no register version in NEON intruction set and there is no register version for right shift neither.
The instructions for match rules of vector rotate variable should be:
# this is the performance regression, loop invairables can't be extracted outside a loop.
mov w9, 32
dup v13.4s, w9
sub v20.4s, v13.4S, v19.4s
----------------------------
sshl v17.4s, v16.4s, v19.4s
neg v18.16b, v20.16b # on aarch64, vector right shift is implemented as left shift by negative shift count
ushl v16.4s, v16.4s, v18.4s
orr v16.16b, v17.16b, v16.16b
The immediate vector rotation should be splitted from RotateLeftV and RotateRightV nodes, so that they can be matched and optimized alone on CPUs like aarch64.