-
Enhancement
-
Resolution: Unresolved
-
P4
-
25
Constant multiplication `x*C` can be optimized to cheaper IRs like add or shift. For example:
1. x*8 can be optimized as x<<3.
2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one ADD-SHIFT instruction on some architectures, like aarch64 and x86_64.
Currently C2 implemented a few such patterns in mid-end, including:
1. |C| = 1<<n (n>0)
2. |C| = (1<<n) - 1 (n>0)
3. |C| = (1<<m) + (1<<n) (m>n, n>=0)
The first two are ok. Because on most architectures they are lowered as
only one ADD/SUB/SHIFT instruction.
But the third pattern doesn't always perform well on some architectures like AArch64. According to the Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. In this case, converting MUL to ADD is not profitable. Hence, adding such transformation in mid-end IR level may get performance regression for some cases.
1. x*8 can be optimized as x<<3.
2. x*9 can be optimized as x+x<<3, and x+x<<3 can be lowered as one ADD-SHIFT instruction on some architectures, like aarch64 and x86_64.
Currently C2 implemented a few such patterns in mid-end, including:
1. |C| = 1<<n (n>0)
2. |C| = (1<<n) - 1 (n>0)
3. |C| = (1<<m) + (1<<n) (m>n, n>=0)
The first two are ok. Because on most architectures they are lowered as
only one ADD/SUB/SHIFT instruction.
But the third pattern doesn't always perform well on some architectures like AArch64. According to the Arm optimization guide, if the shift amount > 4, the latency and throughput of ADD instruction is the same with MUL instruction. In this case, converting MUL to ADD is not profitable. Hence, adding such transformation in mid-end IR level may get performance regression for some cases.