-
Enhancement
-
Resolution: Unresolved
-
P4
-
11, 17, 18, 19
I believe I stumbled upon the benchmark that shows the current OptoLoopAlignment is not enough at least on Zen 2:
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgsPrepend = {"-Xmx1g", "-Xms1g", "-XX:+AlwaysPreTouch", "-XX:+UnlockExperimentalVMOptions", "-XX:+UseEpsilonGC"})
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class Peak {
private int x;
@Benchmark
public int test() {
return x;
}
}
Yields:
# Run progress: 0.00% complete, ETA 00:00:23
# Fork: 1 of 1
# Warmup Iteration 1: 0.545 ns/op
# Warmup Iteration 2: 0.546 ns/op
# Warmup Iteration 3: 0.535 ns/op // <--- first C2 compilation happens here
Iteration 1: 0.534 ns/op
Iteration 2: 0.544 ns/op
Iteration 3: 0.533 ns/op
Iteration 4: 0.534 ns/op
Iteration 5: 0.533 ns/op
Iteration 6: 0.531 ns/op
Iteration 7: 0.416 ns/op // <--- C2 recompilation happens here
Iteration 8: 0.414 ns/op
Iteration 9: 0.415 ns/op
Iteration 10: 0.414 ns/op
Iteration 11: 0.431 ns/op
Iteration 12: 0.413 ns/op
Iteration 13: 0.413 ns/op
...
I managed to dump the generated assembly in both cases, and it is identical instruction-wise. The hot loop is just:
↗ 0x00007ff260ad6c70: mov 0xc(%rbx),%r11d
2.06% │ 0x00007ff260ad6c74: movzbl 0x94(%rcx),%r8d
2.95% │ 0x00007ff260ad6c7c: mov 0x340(%r15),%r11
1.26% │ 0x00007ff260ad6c83: add $0x1,%r13
0.10% │ 0x00007ff260ad6c87: test %eax,(%r11)
90.45% │ 0x00007ff260ad6c8a: test %r8d,%r8d
0.45% ╰ 0x00007ff260ad6c8d: je 0x00007ff260ad6c70
What seems to differ is the starting address for the loop. Hypothesis: a second recompilation drops something from method prolog, like clinit barrier, which accidentally puts the alignment "right" in many cases.
Perfnorm data shows that fast case runs at 4.38 insns/clk and slow case runs at 3.48 insns/clk, taking exactly the same number of instructions in ~1.6 cycles and ~2.0 cycles, respectively. This points to some micro-architectural effect in play.
I put this hacky block in MachPrologNode::emit, and ran many times with random AddNopsAtEntry (you cannot just do os::random() there, because node size would variate and emitter would complain):
if (AddNopsAtEntry > 0) {
__ nop(AddNopsAtEntry);
}
The performance went all over the place, in line with a theory that accidental loop alignment matters.
$ java -jar target/multirun.jar
ns/op, Err
0.42, 0.03
0.44, 0.07
0.42, 0.01
0.42, 0.05
0.54, 0.02 // <--- hiccup
0.41, 0.00
0.41, 0.00
0.43, 0.05
0.43, 0.06
0.41, 0.01
0.41, 0.00
0.43, 0.02
0.41, 0.00
0.48, 0.37 // <--- hiccup
0.53, 0.02 // <--- hiccup
Then I tookJDK-8281467, which allows for larger alignments, and it performed consistently:
$ java -XX:OptoLoopAlignment=32 -jar target/multirun.jar
ns/op, Err
0.42, 0.01
0.43, 0.02
0.41, 0.01
0.41, 0.01
0.44, 0.02
0.42, 0.04
0.41, 0.00
0.43, 0.03
0.42, 0.01
0.41, 0.00
0.41, 0.00
0.43, 0.04
0.43, 0.03
0.45, 0.11
0.43, 0.05
0.42, 0.02
SPECjvm2008 runs do not show visible performance regressions with either 16, 32, or 64 alignment, so this might be a very thin effect.
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(value = 1, jvmArgsPrepend = {"-Xmx1g", "-Xms1g", "-XX:+AlwaysPreTouch", "-XX:+UnlockExperimentalVMOptions", "-XX:+UseEpsilonGC"})
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class Peak {
private int x;
@Benchmark
public int test() {
return x;
}
}
Yields:
# Run progress: 0.00% complete, ETA 00:00:23
# Fork: 1 of 1
# Warmup Iteration 1: 0.545 ns/op
# Warmup Iteration 2: 0.546 ns/op
# Warmup Iteration 3: 0.535 ns/op // <--- first C2 compilation happens here
Iteration 1: 0.534 ns/op
Iteration 2: 0.544 ns/op
Iteration 3: 0.533 ns/op
Iteration 4: 0.534 ns/op
Iteration 5: 0.533 ns/op
Iteration 6: 0.531 ns/op
Iteration 7: 0.416 ns/op // <--- C2 recompilation happens here
Iteration 8: 0.414 ns/op
Iteration 9: 0.415 ns/op
Iteration 10: 0.414 ns/op
Iteration 11: 0.431 ns/op
Iteration 12: 0.413 ns/op
Iteration 13: 0.413 ns/op
...
I managed to dump the generated assembly in both cases, and it is identical instruction-wise. The hot loop is just:
↗ 0x00007ff260ad6c70: mov 0xc(%rbx),%r11d
2.06% │ 0x00007ff260ad6c74: movzbl 0x94(%rcx),%r8d
2.95% │ 0x00007ff260ad6c7c: mov 0x340(%r15),%r11
1.26% │ 0x00007ff260ad6c83: add $0x1,%r13
0.10% │ 0x00007ff260ad6c87: test %eax,(%r11)
90.45% │ 0x00007ff260ad6c8a: test %r8d,%r8d
0.45% ╰ 0x00007ff260ad6c8d: je 0x00007ff260ad6c70
What seems to differ is the starting address for the loop. Hypothesis: a second recompilation drops something from method prolog, like clinit barrier, which accidentally puts the alignment "right" in many cases.
Perfnorm data shows that fast case runs at 4.38 insns/clk and slow case runs at 3.48 insns/clk, taking exactly the same number of instructions in ~1.6 cycles and ~2.0 cycles, respectively. This points to some micro-architectural effect in play.
I put this hacky block in MachPrologNode::emit, and ran many times with random AddNopsAtEntry (you cannot just do os::random() there, because node size would variate and emitter would complain):
if (AddNopsAtEntry > 0) {
__ nop(AddNopsAtEntry);
}
The performance went all over the place, in line with a theory that accidental loop alignment matters.
$ java -jar target/multirun.jar
ns/op, Err
0.42, 0.03
0.44, 0.07
0.42, 0.01
0.42, 0.05
0.54, 0.02 // <--- hiccup
0.41, 0.00
0.41, 0.00
0.43, 0.05
0.43, 0.06
0.41, 0.01
0.41, 0.00
0.43, 0.02
0.41, 0.00
0.48, 0.37 // <--- hiccup
0.53, 0.02 // <--- hiccup
Then I took
$ java -XX:OptoLoopAlignment=32 -jar target/multirun.jar
ns/op, Err
0.42, 0.01
0.43, 0.02
0.41, 0.01
0.41, 0.01
0.44, 0.02
0.42, 0.04
0.41, 0.00
0.43, 0.03
0.42, 0.01
0.41, 0.00
0.41, 0.00
0.43, 0.04
0.43, 0.03
0.45, 0.11
0.43, 0.05
0.42, 0.02
SPECjvm2008 runs do not show visible performance regressions with either 16, 32, or 64 alignment, so this might be a very thin effect.
- relates to
-
JDK-8281467 Allow larger OptoLoopAlignment and CodeEntryAlignment
-
- Resolved
-