Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8318986

Improve GenericWaitBarrier performance

XMLWordPrintable

    • b26

        While running simple benchmarks for safepoints, I was surprised to see impressively bad performance on my Mac M1 with a simple workload like this:

        ```
        public class LotsRunnable {
           static final int THREAD_COUNT = Integer.getInteger("threads", Runtime.getRuntime().availableProcessors() * 4);
           static Object sink;

           public static void main(String... args) throws Exception {
             for (int c = 0; c < THREAD_COUNT; c++) {
               Thread t = new Thread(() -> {
                 while (true) {
                    Thread.onSpinWait();
                 }
               });
               t.setDaemon(true);
               t.start();
             }

             System.out.println("Started");

             long stop = System.nanoTime() + 10_000_000_000L;
             while (System.nanoTime() < stop) {
               sink = new byte[100_000];
             }
           }
        }
        ```

        If you run with -Xlog:safepoint -Xlog:gc, then you would notice that GC pause times and the actual vm op times are completely out of whack. For example:

        ```
        $ java -Xlog:safepoint -Xlog:gc -Xmx2g LotsRunnable.java
        [3.188s][info][gc ] GC(19) Pause Young (Normal) (G1 Evacuation Pause) 308M->2M(514M) 0.878ms
        [3.326s][info][safepoint] Safepoint "G1CollectForAllocation", Time since last: 4963375 ns, Reaching safepoint: 349292 ns, Cleanup: 2000 ns, At safepoint: 138700375 ns, Total: 139051667 ns
        ```
        Note how the pause is <1ms, but the "At safepoint" is whole 138 ms (!!!).

        Deeper profiling shows that the problem is on the path where we wake up the threads from the safepoint:
         https://github.com/openjdk/jdk/blob/4f9f1955ab2737880158c57d4891d90e2fd2f5d7/src/hotspot/share/runtime/safepoint.cpp#L494-L495

        JDK-8214271 ("Fast primitive to wake many threads") added the WaitBarrier to serve on that path. Before that, in JDK 11, the performance is okay. This makes it a regression between JDK 11 and JDK 17.

        WaitBarrier has two implementations: one for Linux that uses futex-es, and another generic one that uses semaphores. For implementation reasons, the generic version has to wait for all threads to leave the barrier before it unblocks from disarm(). This means that all threads that are currently blocked for safepoint need to roll out of wait() before we unblock from safepoint! Which effectively runs into the same problem as TTSP, only worse: all those threads are blocked, need to be woken up, scheduled, etc.

        This is not what Linux futex-based implementation does: it just notifies the futex, and leaves.

        While unblocked threads start to execute, and so we are not completely blocked waiting for disarm(), this definitely:
         a) trips the safepoint timings;
         b) delays any further actions of VMThread;
         c) delays resuming GC from STS, as `Universe::heap()->safepoint_synchronize_end()` comes after this;
         d) places a limit on the safepoint frequency we can have;
         e) maybe something else I cannot see right away;

        I think the intent for the safepoint end code is to be fast to avoid any of these surprises. To that end, I think we can improve GenericWaitBarrier to avoid most of the performance cliff.

        WIP: https://github.com/openjdk/jdk/pull/16404

              shade Aleksey Shipilev
              shade Aleksey Shipilev
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: