-
Type:
Task
-
Resolution: Unresolved
-
Priority:
P3
-
Affects Version/s: 21, 25
-
Component/s: hotspot
Taking a GC while being in JNI critical section can deadlock. For example, after JDK-8192647, in VM GC operation prolog, we are waiting until all Java threads leave the JNI critical region before proceeding. However, that prolog is done by Java thread currently requesting GC either by the means of explicit (e.g. System.gc()) request or due to allocation failure leading to GC. So that thread wants have to wait for all Java threads to leave JNI critical regions. Which is awkward if the caller Java thread is in JNI critical region *itself*. It will never leave, since it is stuck in checking loop. Which is guaranteed to deadlock.
I believe that realistically stops any further GCs from happening, as Java threads would come to the same block, running out of allocatable memory one by one. But it will look that only a few threads are stuck and no GC progress happening, for a while. We seem to have caught an issue like that in Lucene workloads running on JDK 25; still confirming if this is a root cause. I found this issue by reading theJDK-8192647 code, as that was our primary suspect. But the issue itself is not very new, and would deserve some more verification: JDK-8375209.
JNI Spec Chapter 4 says:
"Inside a “critical region” the native code should not run for an indefinite period of time, must not invoke arbitrary JNI functions, and must not perform operations that might cause the current thread to block and wait for another thread in the virtual machine. Given these restrictions, the virtual machine can temporarily disable garbage collection while giving the native code direct access to array elements."
So arguably there is _some_ leeway in deadlocking when something special happens in the JNI critical region. The implementations can still make better choices when it comes to corner cases. Simple reproducer, adopted from Shenandoah JNI stress tests:
https://github.com/openjdk/jdk/compare/master...shipilev:jdk:JDK-8375188-gc-deadlock
Run it, and it guaranteed to hit the asserts in current mainline:
$ make test TEST=gc/jni
...
# Internal Error (/home/shade/trunks/jdk/src/hotspot/share/gc/shared/gcLocker.cpp:102), pid=4157666, tid=4157698
# assert(!java_thread->in_critical_atomic()) failed: About to deadlock...
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x1000adb] GCLocker::block()+0x61b (gcLocker.cpp:102)
V [libjvm.so+0x1016923] VM_GC_Operation::doit_prologue()+0xc3 (gcVMOperations.cpp:125)
V [libjvm.so+0x1ecbfcc] VMThread::execute(VM_Operation*)+0x7c (vmThread.cpp:541)
V [libjvm.so+0x187f7df] ParallelScavengeHeap::collect(GCCause::Cause)+0x8f (parallelScavengeHeap.cpp:567)
V [libjvm.so+0x1305222] JVM_GC+0x192 (jvm.cpp:454)
j java.lang.Runtime.gc()V+0 java.base@27-internal
j java.lang.System.gc()V+3 java.base@27-internal
j TestPinnedDeadlock.main([Ljava/lang/String;)V+9
...or it will get stuck without the assert.
Per GC status:
Serial: 17 PASSES; 21, 25, mainline FAILS -- technically a regression in JDK 21
Parallel: 17 PASSES; 21, 25, mainline FAILS -- technically a regression in JDK 21
G1: 17 FAILS; 21, 25, mainline PASSES -- excellent, improves in JDK 21
Shenandoah: 17, 21, 25, mainline PASSES
ZGC: 17, 21, 25, mainline FAILS
Epsilon: 17, 21, 25, mainline PASSES
If we want to go for GC-specific fixes, those should probably be split into separate related tasks.
I believe that realistically stops any further GCs from happening, as Java threads would come to the same block, running out of allocatable memory one by one. But it will look that only a few threads are stuck and no GC progress happening, for a while. We seem to have caught an issue like that in Lucene workloads running on JDK 25; still confirming if this is a root cause. I found this issue by reading the
JNI Spec Chapter 4 says:
"Inside a “critical region” the native code should not run for an indefinite period of time, must not invoke arbitrary JNI functions, and must not perform operations that might cause the current thread to block and wait for another thread in the virtual machine. Given these restrictions, the virtual machine can temporarily disable garbage collection while giving the native code direct access to array elements."
So arguably there is _some_ leeway in deadlocking when something special happens in the JNI critical region. The implementations can still make better choices when it comes to corner cases. Simple reproducer, adopted from Shenandoah JNI stress tests:
https://github.com/openjdk/jdk/compare/master...shipilev:jdk:JDK-8375188-gc-deadlock
Run it, and it guaranteed to hit the asserts in current mainline:
$ make test TEST=gc/jni
...
# Internal Error (/home/shade/trunks/jdk/src/hotspot/share/gc/shared/gcLocker.cpp:102), pid=4157666, tid=4157698
# assert(!java_thread->in_critical_atomic()) failed: About to deadlock...
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x1000adb] GCLocker::block()+0x61b (gcLocker.cpp:102)
V [libjvm.so+0x1016923] VM_GC_Operation::doit_prologue()+0xc3 (gcVMOperations.cpp:125)
V [libjvm.so+0x1ecbfcc] VMThread::execute(VM_Operation*)+0x7c (vmThread.cpp:541)
V [libjvm.so+0x187f7df] ParallelScavengeHeap::collect(GCCause::Cause)+0x8f (parallelScavengeHeap.cpp:567)
V [libjvm.so+0x1305222] JVM_GC+0x192 (jvm.cpp:454)
j java.lang.Runtime.gc()V+0 java.base@27-internal
j java.lang.System.gc()V+3 java.base@27-internal
j TestPinnedDeadlock.main([Ljava/lang/String;)V+9
...or it will get stuck without the assert.
Per GC status:
Serial: 17 PASSES; 21, 25, mainline FAILS -- technically a regression in JDK 21
Parallel: 17 PASSES; 21, 25, mainline FAILS -- technically a regression in JDK 21
G1: 17 FAILS; 21, 25, mainline PASSES -- excellent, improves in JDK 21
Shenandoah: 17, 21, 25, mainline PASSES
ZGC: 17, 21, 25, mainline FAILS
Epsilon: 17, 21, 25, mainline PASSES
If we want to go for GC-specific fixes, those should probably be split into separate related tasks.
- relates to
-
JDK-8192647 GClocker induced GCs can starve threads requiring memory leading to OOME
-
- Resolved
-
-
JDK-8375209 Strengthen Xcheck:jni for JNI critical regions
-
- Open
-