-
Bug
-
Resolution: Unresolved
-
P4
-
11, 17, 21, 22, 23
I have been investigating the source of a very large run-to-run variance in Renaissance:scala-kmeans test. This spans multiple releases, at least beginning with JDK 11.
The swings are quite large:
```
$ java -Xmx4g -Xms4g -XX:+AlwaysPreTouch -XX:LoopUnrollLimit=1 -XX:-UseCompressedOops -jar renaissance-jmh-0.15.0.jar ScalaKmeans -f 200 -wi 5 -i 5
...
# Run progress: 16.00% complete, ETA 00:08:33
# Fork: 33 of 200
# Warmup Iteration 1: 248.163 ms/op
# Warmup Iteration 2: 168.640 ms/op
# Warmup Iteration 3: 167.313 ms/op
# Warmup Iteration 4: 167.920 ms/op
# Warmup Iteration 5: 156.700 ms/op
Iteration 1: 156.609 ms/op
Iteration 2: 161.525 ms/op
Iteration 3: 158.755 ms/op
Iteration 4: 157.219 ms/op
Iteration 5: 157.120 ms/op
# Run progress: 16.50% complete, ETA 00:08:30
# Fork: 34 of 200
# Warmup Iteration 1: 278.384 ms/op
# Warmup Iteration 2: 236.705 ms/op
# Warmup Iteration 3: 230.085 ms/op
# Warmup Iteration 4: 226.939 ms/op
# Warmup Iteration 5: 226.689 ms/op
Iteration 1: 222.489 ms/op
Iteration 2: 223.541 ms/op
Iteration 3: 222.218 ms/op
Iteration 4: 222.187 ms/op
Iteration 5: 225.900 ms/op
# Run progress: 17.00% complete, ETA 00:08:29
# Fork: 35 of 200
# Warmup Iteration 1: 246.660 ms/op
# Warmup Iteration 2: 167.399 ms/op
# Warmup Iteration 3: 167.279 ms/op
# Warmup Iteration 4: 166.729 ms/op
# Warmup Iteration 5: 157.692 ms/op
Iteration 1: 157.223 ms/op
Iteration 2: 161.327 ms/op
Iteration 3: 160.988 ms/op
Iteration 4: 157.841 ms/op
Iteration 5: 157.769 ms/op
...
```
This shows in out of the box mode as well, I added a few options to make disassembly simpler.
Perfasm shows the considerable differences between C2 generated code in fast and slow cases. Full perfasms have multiple things going on, but if we extract the actual kernels from there (fast-kernel.txt, slow-kernel.txt), it becomes clearer that in slow case we end up reading twice as many things from some (captured in lambda?) object.
This is why I suspect there is some C2 performance oddity at play here.
The swings are quite large:
```
$ java -Xmx4g -Xms4g -XX:+AlwaysPreTouch -XX:LoopUnrollLimit=1 -XX:-UseCompressedOops -jar renaissance-jmh-0.15.0.jar ScalaKmeans -f 200 -wi 5 -i 5
...
# Run progress: 16.00% complete, ETA 00:08:33
# Fork: 33 of 200
# Warmup Iteration 1: 248.163 ms/op
# Warmup Iteration 2: 168.640 ms/op
# Warmup Iteration 3: 167.313 ms/op
# Warmup Iteration 4: 167.920 ms/op
# Warmup Iteration 5: 156.700 ms/op
Iteration 1: 156.609 ms/op
Iteration 2: 161.525 ms/op
Iteration 3: 158.755 ms/op
Iteration 4: 157.219 ms/op
Iteration 5: 157.120 ms/op
# Run progress: 16.50% complete, ETA 00:08:30
# Fork: 34 of 200
# Warmup Iteration 1: 278.384 ms/op
# Warmup Iteration 2: 236.705 ms/op
# Warmup Iteration 3: 230.085 ms/op
# Warmup Iteration 4: 226.939 ms/op
# Warmup Iteration 5: 226.689 ms/op
Iteration 1: 222.489 ms/op
Iteration 2: 223.541 ms/op
Iteration 3: 222.218 ms/op
Iteration 4: 222.187 ms/op
Iteration 5: 225.900 ms/op
# Run progress: 17.00% complete, ETA 00:08:29
# Fork: 35 of 200
# Warmup Iteration 1: 246.660 ms/op
# Warmup Iteration 2: 167.399 ms/op
# Warmup Iteration 3: 167.279 ms/op
# Warmup Iteration 4: 166.729 ms/op
# Warmup Iteration 5: 157.692 ms/op
Iteration 1: 157.223 ms/op
Iteration 2: 161.327 ms/op
Iteration 3: 160.988 ms/op
Iteration 4: 157.841 ms/op
Iteration 5: 157.769 ms/op
...
```
This shows in out of the box mode as well, I added a few options to make disassembly simpler.
Perfasm shows the considerable differences between C2 generated code in fast and slow cases. Full perfasms have multiple things going on, but if we extract the actual kernels from there (fast-kernel.txt, slow-kernel.txt), it becomes clearer that in slow case we end up reading twice as many things from some (captured in lambda?) object.
This is why I suspect there is some C2 performance oddity at play here.