-
Bug
-
Resolution: Unresolved
-
P3
-
24
-
riscv
-
linux
When running e.g. chi-square a large performance regression can be seen on some hardware (in this case P550).
These renaissance benchmarks are highly compiler dependent, meaning result can vary with 30% run to run due to differences in coda cache (both due to profiling and due to placement of code).
One major factor is that pre-24 rv64 used trampoline calls:
##############
0x00007ff43025ee8c: jal ra,0x00007ff43025f16c // if target reachable we did a direct call here, otherwise via tramopline
...
0x00007ff43025f16c: auipc t1,0x0 ; {trampoline_stub}
0x00007ff43025f170: ld t1,12(t1) # 0x00007ff43025f178
0x00007ff43025f174: jalr zero,0(t1)
0x00007ff43025f178: <8-byte address> // atomically patchable
#################
Due to issues with loading intra-cache and an unneeded jump this was change in: "8332689: RISC-V: Use load instead of trampolines"
##################
0x00007ff3b4342c30: auipc t1,0x0
0x00007ff3b4342c34: ld t1,832(t1) # 0x00007ff3b4342f70
0x00007ff3b4342c38: jalr ra,0(t1)
...
0x00007ff3b4342f70: <8-byte address> // atomically patchable
#################
But this implementation didn't have direct calls, as they in practice are rare.
These renaissance benchmarks are highly compiler dependent, meaning result can vary with 30% run to run due to differences in coda cache (both due to profiling and due to placement of code).
One major factor is that pre-24 rv64 used trampoline calls:
##############
0x00007ff43025ee8c: jal ra,0x00007ff43025f16c // if target reachable we did a direct call here, otherwise via tramopline
...
0x00007ff43025f16c: auipc t1,0x0 ; {trampoline_stub}
0x00007ff43025f170: ld t1,12(t1) # 0x00007ff43025f178
0x00007ff43025f174: jalr zero,0(t1)
0x00007ff43025f178: <8-byte address> // atomically patchable
#################
Due to issues with loading intra-cache and an unneeded jump this was change in: "8332689: RISC-V: Use load instead of trampolines"
##################
0x00007ff3b4342c30: auipc t1,0x0
0x00007ff3b4342c34: ld t1,832(t1) # 0x00007ff3b4342f70
0x00007ff3b4342c38: jalr ra,0(t1)
...
0x00007ff3b4342f70: <8-byte address> // atomically patchable
#################
But this implementation didn't have direct calls, as they in practice are rare.