I noticed this while experimenting with JDK-8149758, in C2_MacroAssembler::genmask as mov64(temp, -1L) which is encoded using 10 bytes.
For immediate values that fit in 32 bits, we could use shorter encodings with movl and movq instead of movabs, similar to what MacroAssembler::movptr is doing.
I have found only 1 location (in MacroAssembler::ic_call) that depends on mov64 being 10 bytes and it can be excluded.
In renaissance dotty, most of the savings are with the immediate values 0x3ffffff, 0xfffffffffffffffc and 0xffffffff, for a total of ~460 byte savings:
java -jar ~/.cache/stress/renaissance-gpl-0.16.1.jar -r 1 dotty
During java build with make, I have seen ~1K savings in one instance.
While the number may seem small, the fix may improve instruction cache hit rate and reduce decode pressure on the frontend. The proposed change would not change semantics either, between movabs/movl/movq.
For immediate values that fit in 32 bits, we could use shorter encodings with movl and movq instead of movabs, similar to what MacroAssembler::movptr is doing.
I have found only 1 location (in MacroAssembler::ic_call) that depends on mov64 being 10 bytes and it can be excluded.
In renaissance dotty, most of the savings are with the immediate values 0x3ffffff, 0xfffffffffffffffc and 0xffffffff, for a total of ~460 byte savings:
java -jar ~/.cache/stress/renaissance-gpl-0.16.1.jar -r 1 dotty
During java build with make, I have seen ~1K savings in one instance.
While the number may seem small, the fix may improve instruction cache hit rate and reduce decode pressure on the frontend. The proposed change would not change semantics either, between movabs/movl/movq.
- relates to
-
JDK-8149758 Small arraycopy of non-constant length is slower than individual load/stores
-
- Open
-
- links to
-
Review(master)
openjdk/jdk/30073