-
Enhancement
-
Resolution: Fixed
-
P3
-
17, 21, 22
-
b26
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8330642 | 21.0.4 | Aleksey Shipilev | P3 | Resolved | Fixed | b01 |
JDK-8333244 | 17.0.13 | Aleksey Shipilev | P3 | Resolved | Fixed | b01 |
While running simple benchmarks for safepoints, I was surprised to see impressively bad performance on my Mac M1 with a simple workload like this:
```
public class LotsRunnable {
static final int THREAD_COUNT = Integer.getInteger("threads", Runtime.getRuntime().availableProcessors() * 4);
static Object sink;
public static void main(String... args) throws Exception {
for (int c = 0; c < THREAD_COUNT; c++) {
Thread t = new Thread(() -> {
while (true) {
Thread.onSpinWait();
}
});
t.setDaemon(true);
t.start();
}
System.out.println("Started");
long stop = System.nanoTime() + 10_000_000_000L;
while (System.nanoTime() < stop) {
sink = new byte[100_000];
}
}
}
```
If you run with -Xlog:safepoint -Xlog:gc, then you would notice that GC pause times and the actual vm op times are completely out of whack. For example:
```
$ java -Xlog:safepoint -Xlog:gc -Xmx2g LotsRunnable.java
[3.188s][info][gc ] GC(19) Pause Young (Normal) (G1 Evacuation Pause) 308M->2M(514M) 0.878ms
[3.326s][info][safepoint] Safepoint "G1CollectForAllocation", Time since last: 4963375 ns, Reaching safepoint: 349292 ns, Cleanup: 2000 ns, At safepoint: 138700375 ns, Total: 139051667 ns
```
Note how the pause is <1ms, but the "At safepoint" is whole 138 ms (!!!).
Deeper profiling shows that the problem is on the path where we wake up the threads from the safepoint:
https://github.com/openjdk/jdk/blob/4f9f1955ab2737880158c57d4891d90e2fd2f5d7/src/hotspot/share/runtime/safepoint.cpp#L494-L495
JDK-8214271 ("Fast primitive to wake many threads") added the WaitBarrier to serve on that path. Before that, in JDK 11, the performance is okay. This makes it a regression between JDK 11 and JDK 17.
WaitBarrier has two implementations: one for Linux that uses futex-es, and another generic one that uses semaphores. For implementation reasons, the generic version has to wait for all threads to leave the barrier before it unblocks from disarm(). This means that all threads that are currently blocked for safepoint need to roll out of wait() before we unblock from safepoint! Which effectively runs into the same problem as TTSP, only worse: all those threads are blocked, need to be woken up, scheduled, etc.
This is not what Linux futex-based implementation does: it just notifies the futex, and leaves.
While unblocked threads start to execute, and so we are not completely blocked waiting for disarm(), this definitely:
a) trips the safepoint timings;
b) delays any further actions of VMThread;
c) delays resuming GC from STS, as `Universe::heap()->safepoint_synchronize_end()` comes after this;
d) places a limit on the safepoint frequency we can have;
e) maybe something else I cannot see right away;
I think the intent for the safepoint end code is to be fast to avoid any of these surprises. To that end, I think we can improve GenericWaitBarrier to avoid most of the performance cliff.
WIP: https://github.com/openjdk/jdk/pull/16404
```
public class LotsRunnable {
static final int THREAD_COUNT = Integer.getInteger("threads", Runtime.getRuntime().availableProcessors() * 4);
static Object sink;
public static void main(String... args) throws Exception {
for (int c = 0; c < THREAD_COUNT; c++) {
Thread t = new Thread(() -> {
while (true) {
Thread.onSpinWait();
}
});
t.setDaemon(true);
t.start();
}
System.out.println("Started");
long stop = System.nanoTime() + 10_000_000_000L;
while (System.nanoTime() < stop) {
sink = new byte[100_000];
}
}
}
```
If you run with -Xlog:safepoint -Xlog:gc, then you would notice that GC pause times and the actual vm op times are completely out of whack. For example:
```
$ java -Xlog:safepoint -Xlog:gc -Xmx2g LotsRunnable.java
[3.188s][info][gc ] GC(19) Pause Young (Normal) (G1 Evacuation Pause) 308M->2M(514M) 0.878ms
[3.326s][info][safepoint] Safepoint "G1CollectForAllocation", Time since last: 4963375 ns, Reaching safepoint: 349292 ns, Cleanup: 2000 ns, At safepoint: 138700375 ns, Total: 139051667 ns
```
Note how the pause is <1ms, but the "At safepoint" is whole 138 ms (!!!).
Deeper profiling shows that the problem is on the path where we wake up the threads from the safepoint:
https://github.com/openjdk/jdk/blob/4f9f1955ab2737880158c57d4891d90e2fd2f5d7/src/hotspot/share/runtime/safepoint.cpp#L494-L495
WaitBarrier has two implementations: one for Linux that uses futex-es, and another generic one that uses semaphores. For implementation reasons, the generic version has to wait for all threads to leave the barrier before it unblocks from disarm(). This means that all threads that are currently blocked for safepoint need to roll out of wait() before we unblock from safepoint! Which effectively runs into the same problem as TTSP, only worse: all those threads are blocked, need to be woken up, scheduled, etc.
This is not what Linux futex-based implementation does: it just notifies the futex, and leaves.
While unblocked threads start to execute, and so we are not completely blocked waiting for disarm(), this definitely:
a) trips the safepoint timings;
b) delays any further actions of VMThread;
c) delays resuming GC from STS, as `Universe::heap()->safepoint_synchronize_end()` comes after this;
d) places a limit on the safepoint frequency we can have;
e) maybe something else I cannot see right away;
I think the intent for the safepoint end code is to be fast to avoid any of these surprises. To that end, I think we can improve GenericWaitBarrier to avoid most of the performance cliff.
WIP: https://github.com/openjdk/jdk/pull/16404
- backported by
-
JDK-8330642 Improve GenericWaitBarrier performance
- Resolved
-
JDK-8333244 Improve GenericWaitBarrier performance
- Resolved
- relates to
-
JDK-8214271 Fast primitive to wake many threads
- Resolved
- links to
-
Commit openjdk/jdk17u-dev/515bc9a2
-
Commit openjdk/jdk21u-dev/6c5500bb
-
Commit openjdk/jdk/30462f9d
-
Review openjdk/jdk17u-dev/2041
-
Review openjdk/jdk21u-dev/70
-
Review openjdk/jdk/16404
(4 links to)