Investigation: GC request deadlocks when holding a pinned object

XMLWordPrintable

    • gc

      Taking a GC while being in JNI critical section can deadlock. For example, after JDK-8192647, in VM GC operation prolog, we are waiting until all Java threads leave the JNI critical region before proceeding. However, that prolog is done by Java thread currently requesting GC either by the means of explicit (e.g. System.gc()) request or due to allocation failure leading to GC. So that thread wants have to wait for all Java threads to leave JNI critical regions. Which is awkward if the caller Java thread is in JNI critical region *itself*. It will never leave, since it is stuck in checking loop. Which is guaranteed to deadlock.

      I believe that realistically stops any further GCs from happening, as Java threads would come to the same block, running out of allocatable memory one by one. But it will look that only a few threads are stuck and no GC progress happening, for a while. We seem to have caught an issue like that in Lucene workloads running on JDK 25; still confirming if this is a root cause. I found this issue by reading the JDK-8192647 code, as that was our primary suspect. But the issue itself is not very new, and would deserve some more verification: JDK-8375209.

      JNI Spec Chapter 4 says:
      "Inside a “critical region” the native code should not run for an indefinite period of time, must not invoke arbitrary JNI functions, and must not perform operations that might cause the current thread to block and wait for another thread in the virtual machine. Given these restrictions, the virtual machine can temporarily disable garbage collection while giving the native code direct access to array elements."

      So arguably there is _some_ leeway in deadlocking when something special happens in the JNI critical region. The implementations can still make better choices when it comes to corner cases. Simple reproducer, adopted from Shenandoah JNI stress tests:
       https://github.com/openjdk/jdk/compare/master...shipilev:jdk:JDK-8375188-gc-deadlock

      Run it, and it guaranteed to hit the asserts in current mainline:

      $ make test TEST=gc/jni
      ...

      # Internal Error (/home/shade/trunks/jdk/src/hotspot/share/gc/shared/gcLocker.cpp:102), pid=4157666, tid=4157698
      # assert(!java_thread->in_critical_atomic()) failed: About to deadlock...

      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
      V [libjvm.so+0x1000adb] GCLocker::block()+0x61b (gcLocker.cpp:102)
      V [libjvm.so+0x1016923] VM_GC_Operation::doit_prologue()+0xc3 (gcVMOperations.cpp:125)
      V [libjvm.so+0x1ecbfcc] VMThread::execute(VM_Operation*)+0x7c (vmThread.cpp:541)
      V [libjvm.so+0x187f7df] ParallelScavengeHeap::collect(GCCause::Cause)+0x8f (parallelScavengeHeap.cpp:567)
      V [libjvm.so+0x1305222] JVM_GC+0x192 (jvm.cpp:454)
      j java.lang.Runtime.gc()V+0 java.base@27-internal
      j java.lang.System.gc()V+3 java.base@27-internal
      j TestPinnedDeadlock.main([Ljava/lang/String;)V+9

      ...or it will get stuck without the assert.

      Per GC status:
       Serial: 17 PASSES; 21, 25, mainline FAILS -- technically a regression in JDK 21
       Parallel: 17 PASSES; 21, 25, mainline FAILS -- technically a regression in JDK 21
       G1: 17 FAILS; 21, 25, mainline PASSES -- excellent, improves in JDK 21
       Shenandoah: 17, 21, 25, mainline PASSES
       ZGC: 17, 21, 25, mainline FAILS
       Epsilon: 17, 21, 25, mainline PASSES

      If we want to go for GC-specific fixes, those should probably be split into separate related tasks.

            Assignee:
            Unassigned
            Reporter:
            Aleksey Shipilev
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: