-
Enhancement
-
Resolution: Unresolved
-
P4
-
None
-
21, 25, 26
Looking at Renaissance benchmarks, I notice that some benchmarks like scala-doku are significantly slower with Shenandoah in comparison with other collectors:
$ build/linux-aarch64-server-release/images/jdk/bin/java -jar ~/renaissance-jmh-0.16.0.jar ScalaDoku -wi 3 -i 3 -f 1 --jvmArgs "-Xmx8g -Xms8g -XX:+AlwaysPreTouch -XX:+UseParallelGC"
Benchmark Mode Cnt Score Error Units
JmhScalaDoku.run ss 3 2160.655 ± 364.043 ms/op
$ build/linux-aarch64-server-release/images/jdk/bin/java -jar ~/renaissance-jmh-0.16.0.jar ScalaDoku -wi 3 -i 3 -f 1 --jvmArgs "-Xmx8g -Xms8g -XX:+AlwaysPreTouch -XX:+UseShenandoahGC"
Benchmark Mode Cnt Score Error Units
JmhScalaDoku.run ss 3 3843.770 ± 740.348 ms/op
perfasm shows the hotspot is in the "dmb ishld" in nmethod entry barrier.
....[Hottest Region 2]..............................................................................
c2, scala.collection.immutable.SetIterator::next, version 1, compile id 893
0x0000ffffa8599008: nop
[Entry Point]
# {method} {0x0000fffe789eb008} 'next' '()Ljava/lang/Object;' in 'scala/collection/immutable/SetIterator'
# [sp+0x30] (sp of caller)
0x0000ffffa859900c: ldr w8, [x1, #8]
0x0000ffffa8599010: ldr w10, [x9, #8]
0x0000ffffa8599014: cmp w8, w10
╭ 0x0000ffffa8599018: b.eq 0x0000ffffa8599020 // b.none
│ 0x0000ffffa859901c: b 0x0000ffffa848ec60 ; {runtime_call Shared Runtime ic_miss_blob}
│ [Verified Entry Point]
0.19% ↘ 0x0000ffffa8599020: sub x9, sp, #0x14, lsl #12
0.09% 0x0000ffffa8599024: str xzr, [x9]
0.09% 0x0000ffffa8599028: sub sp, sp, #0x30
0.09% 0x0000ffffa859902c: stp x29, x30, [sp, #32]
0.09% 0x0000ffffa8599030: ldr w8, 0x0000ffffa8599174
0.07% 0x0000ffffa8599034: dmb ishld
7.22% 0x0000ffffa8599038: ldr w9, [x28, #32]
0.08% 0x0000ffffa859903c: cmp w8, w9
It makes sense that it affects some benchmarks that are not as deeply inlined. Erik didJDK-8290700, which ported a new way to sync up nmethod barriers, conc_instruction_and_data_patch, from Generational ZGC repository into mainline. Generational ZGC have been using it since JDK 21. Switching Shenandoah to it like so:
diff --git a/src/hotspot/cpu/aarch64/gc/shenandoah/shenandoahBarrierSetAssembler_aarch64.hpp b/src/hotspot/cpu/aarch64/gc/shenandoah/shenandoahBarrierSetAssembler_aarch64.hpp
index a12d4e2beec..c89847b9d52 100644
--- a/src/hotspot/cpu/aarch64/gc/shenandoah/shenandoahBarrierSetAssembler_aarch64.hpp
+++ b/src/hotspot/cpu/aarch64/gc/shenandoah/shenandoahBarrierSetAssembler_aarch64.hpp
@@ -67,7 +67,7 @@ class ShenandoahBarrierSetAssembler: public BarrierSetAssembler {
Register scratch, RegSet saved_regs);
public:
- virtual NMethodPatchingType nmethod_patching_type() { return NMethodPatchingType::conc_data_patch; }
+ virtual NMethodPatchingType nmethod_patching_type() { return NMethodPatchingType::conc_instruction_and_data_patch; }
#ifdef COMPILER1
void gen_pre_barrier_stub(LIR_Assembler* ce, ShenandoahPreBarrierStub* stub);
...makes Shenandoah perform on significantly better on this example workload:
Benchmark Mode Cnt Score Error Units
JmhScalaDoku.run ss 3 2616.273 ± 51.920 ms/op
We need to see what else should be done to support conc_instruction_and_data_patch in Shenandoah barriers.
$ build/linux-aarch64-server-release/images/jdk/bin/java -jar ~/renaissance-jmh-0.16.0.jar ScalaDoku -wi 3 -i 3 -f 1 --jvmArgs "-Xmx8g -Xms8g -XX:+AlwaysPreTouch -XX:+UseParallelGC"
Benchmark Mode Cnt Score Error Units
JmhScalaDoku.run ss 3 2160.655 ± 364.043 ms/op
$ build/linux-aarch64-server-release/images/jdk/bin/java -jar ~/renaissance-jmh-0.16.0.jar ScalaDoku -wi 3 -i 3 -f 1 --jvmArgs "-Xmx8g -Xms8g -XX:+AlwaysPreTouch -XX:+UseShenandoahGC"
Benchmark Mode Cnt Score Error Units
JmhScalaDoku.run ss 3 3843.770 ± 740.348 ms/op
perfasm shows the hotspot is in the "dmb ishld" in nmethod entry barrier.
....[Hottest Region 2]..............................................................................
c2, scala.collection.immutable.SetIterator::next, version 1, compile id 893
0x0000ffffa8599008: nop
[Entry Point]
# {method} {0x0000fffe789eb008} 'next' '()Ljava/lang/Object;' in 'scala/collection/immutable/SetIterator'
# [sp+0x30] (sp of caller)
0x0000ffffa859900c: ldr w8, [x1, #8]
0x0000ffffa8599010: ldr w10, [x9, #8]
0x0000ffffa8599014: cmp w8, w10
╭ 0x0000ffffa8599018: b.eq 0x0000ffffa8599020 // b.none
│ 0x0000ffffa859901c: b 0x0000ffffa848ec60 ; {runtime_call Shared Runtime ic_miss_blob}
│ [Verified Entry Point]
0.19% ↘ 0x0000ffffa8599020: sub x9, sp, #0x14, lsl #12
0.09% 0x0000ffffa8599024: str xzr, [x9]
0.09% 0x0000ffffa8599028: sub sp, sp, #0x30
0.09% 0x0000ffffa859902c: stp x29, x30, [sp, #32]
0.09% 0x0000ffffa8599030: ldr w8, 0x0000ffffa8599174
0.07% 0x0000ffffa8599034: dmb ishld
7.22% 0x0000ffffa8599038: ldr w9, [x28, #32]
0.08% 0x0000ffffa859903c: cmp w8, w9
It makes sense that it affects some benchmarks that are not as deeply inlined. Erik did
diff --git a/src/hotspot/cpu/aarch64/gc/shenandoah/shenandoahBarrierSetAssembler_aarch64.hpp b/src/hotspot/cpu/aarch64/gc/shenandoah/shenandoahBarrierSetAssembler_aarch64.hpp
index a12d4e2beec..c89847b9d52 100644
--- a/src/hotspot/cpu/aarch64/gc/shenandoah/shenandoahBarrierSetAssembler_aarch64.hpp
+++ b/src/hotspot/cpu/aarch64/gc/shenandoah/shenandoahBarrierSetAssembler_aarch64.hpp
@@ -67,7 +67,7 @@ class ShenandoahBarrierSetAssembler: public BarrierSetAssembler {
Register scratch, RegSet saved_regs);
public:
- virtual NMethodPatchingType nmethod_patching_type() { return NMethodPatchingType::conc_data_patch; }
+ virtual NMethodPatchingType nmethod_patching_type() { return NMethodPatchingType::conc_instruction_and_data_patch; }
#ifdef COMPILER1
void gen_pre_barrier_stub(LIR_Assembler* ce, ShenandoahPreBarrierStub* stub);
...makes Shenandoah perform on significantly better on this example workload:
Benchmark Mode Cnt Score Error Units
JmhScalaDoku.run ss 3 2616.273 ± 51.920 ms/op
We need to see what else should be done to support conc_instruction_and_data_patch in Shenandoah barriers.
- relates to
-
JDK-8290700 Optimize AArch64 nmethod entry barriers
-
- Resolved
-