-
Enhancement
-
Resolution: Fixed
-
P4
-
23
-
b04
If you consider this contrived example:
```
% cat ManyThreadsStacks.java
public class ManyThreadStacks {
static final int THREADS = 1024;
static final int DEPTH = 1024;
static volatile Object sink;
public static void main(String... args) {
for (int t = 0; t < DEPTH; t++) {
int ft = t;
new Thread(() -> work(1024)).start();
}
while (true) {
sink = new byte[100_000];
}
}
public static void work(int depth) {
if (depth > 0) {
work(depth - 1);
}
while (true) {
try {
Thread.sleep(100);
} catch (Exception e) {
return;
}
}
}
}
```
...then the concurrent GC times for both Shenandoah and Z are quite unusual for the amount of work we need to do:
```
% build/linux-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseShenandoahGC -Xlog:gc ManyThreadsStacks.java 2>&1 | grep "marking roots"
[0.846s][info][gc] GC(0) Concurrent marking roots 142.135ms
[1.010s][info][gc] GC(1) Concurrent marking roots 146.222ms
[1.174s][info][gc] GC(2) Concurrent marking roots 150.447ms
[1.330s][info][gc] GC(3) Concurrent marking roots 144.910ms
```
```
% build/linux-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseZGC -Xlog:gc -Xlog:gc+phases ManyThreadsStacks.java | grep "Concurrent Mark"
[1.000s][info][gc,phases] GC(0) Concurrent Mark 154.720ms
[1.001s][info][gc,phases] GC(0) Concurrent Mark Free 0.001ms
[1.187s][info][gc,phases] GC(1) Concurrent Mark 157.702ms
[1.187s][info][gc,phases] GC(1) Concurrent Mark Free 0.001ms
[1.394s][info][gc,phases] GC(2) Concurrent Mark 148.263ms
[1.394s][info][gc,phases] GC(2) Concurrent Mark Free 0.001ms
[1.557s][info][gc,phases] GC(3) Concurrent Mark 153.175ms
[1.558s][info][gc,phases] GC(3) Concurrent Mark Free 0.001ms
```
The profiles show that we are spending the majority of this time on acquiring the per-nmethod locks here:
https://github.com/openjdk/jdk/blob/c6f611cfe0f3d6807b450be19ec00713229dbf42/src/hotspot/share/gc/shenandoah/shenandoahBarrierSetNMethod.cpp#L41
https://github.com/openjdk/jdk/blob/c6f611cfe0f3d6807b450be19ec00713229dbf42/src/hotspot/share/gc/x/xBarrierSetNMethod.cpp#L35
https://github.com/openjdk/jdk/blob/c6f611cfe0f3d6807b450be19ec00713229dbf42/src/hotspot/share/gc/z/zBarrierSetNMethod.cpp#L40
...which kinda makes sense since all the threads are in the same nmethod.
The question is, can we do a double-checked locking here, by doing the `is_armed` check before the lock acquisition? At least for Shenandoah it improves the timings considerably:
```
% build/linux-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseShenandoahGC -Xlog:gc ManyThreadsStacks.java 2>&1 | grep "marking roots"
[0.697s][info][gc] GC(0) Concurrent marking roots 3.914ms
[0.716s][info][gc] GC(1) Concurrent marking roots 3.896ms
[0.737s][info][gc] GC(2) Concurrent marking roots 3.896ms
[0.757s][info][gc] GC(3) Concurrent marking roots 3.876ms
[0.776s][info][gc] GC(4) Concurrent marking roots 3.908ms
```
```
% cat ManyThreadsStacks.java
public class ManyThreadStacks {
static final int THREADS = 1024;
static final int DEPTH = 1024;
static volatile Object sink;
public static void main(String... args) {
for (int t = 0; t < DEPTH; t++) {
int ft = t;
new Thread(() -> work(1024)).start();
}
while (true) {
sink = new byte[100_000];
}
}
public static void work(int depth) {
if (depth > 0) {
work(depth - 1);
}
while (true) {
try {
Thread.sleep(100);
} catch (Exception e) {
return;
}
}
}
}
```
...then the concurrent GC times for both Shenandoah and Z are quite unusual for the amount of work we need to do:
```
% build/linux-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseShenandoahGC -Xlog:gc ManyThreadsStacks.java 2>&1 | grep "marking roots"
[0.846s][info][gc] GC(0) Concurrent marking roots 142.135ms
[1.010s][info][gc] GC(1) Concurrent marking roots 146.222ms
[1.174s][info][gc] GC(2) Concurrent marking roots 150.447ms
[1.330s][info][gc] GC(3) Concurrent marking roots 144.910ms
```
```
% build/linux-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseZGC -Xlog:gc -Xlog:gc+phases ManyThreadsStacks.java | grep "Concurrent Mark"
[1.000s][info][gc,phases] GC(0) Concurrent Mark 154.720ms
[1.001s][info][gc,phases] GC(0) Concurrent Mark Free 0.001ms
[1.187s][info][gc,phases] GC(1) Concurrent Mark 157.702ms
[1.187s][info][gc,phases] GC(1) Concurrent Mark Free 0.001ms
[1.394s][info][gc,phases] GC(2) Concurrent Mark 148.263ms
[1.394s][info][gc,phases] GC(2) Concurrent Mark Free 0.001ms
[1.557s][info][gc,phases] GC(3) Concurrent Mark 153.175ms
[1.558s][info][gc,phases] GC(3) Concurrent Mark Free 0.001ms
```
The profiles show that we are spending the majority of this time on acquiring the per-nmethod locks here:
https://github.com/openjdk/jdk/blob/c6f611cfe0f3d6807b450be19ec00713229dbf42/src/hotspot/share/gc/shenandoah/shenandoahBarrierSetNMethod.cpp#L41
https://github.com/openjdk/jdk/blob/c6f611cfe0f3d6807b450be19ec00713229dbf42/src/hotspot/share/gc/x/xBarrierSetNMethod.cpp#L35
https://github.com/openjdk/jdk/blob/c6f611cfe0f3d6807b450be19ec00713229dbf42/src/hotspot/share/gc/z/zBarrierSetNMethod.cpp#L40
...which kinda makes sense since all the threads are in the same nmethod.
The question is, can we do a double-checked locking here, by doing the `is_armed` check before the lock acquisition? At least for Shenandoah it improves the timings considerably:
```
% build/linux-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseShenandoahGC -Xlog:gc ManyThreadsStacks.java 2>&1 | grep "marking roots"
[0.697s][info][gc] GC(0) Concurrent marking roots 3.914ms
[0.716s][info][gc] GC(1) Concurrent marking roots 3.896ms
[0.737s][info][gc] GC(2) Concurrent marking roots 3.896ms
[0.757s][info][gc] GC(3) Concurrent marking roots 3.876ms
[0.776s][info][gc] GC(4) Concurrent marking roots 3.908ms
```
- relates to
-
JDK-8333716 Shenandoah: Check for disarmed method before taking the nmethod lock
- Resolved
-
JDK-8334890 Missing unconditional cross modifying fence in nmethod entry barriers
- Resolved