Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-2097578 | 5.0 | Y. Ramakrishna | P2 | Resolved | Fixed | b36 |
There is a bug in the communication mechanism between the CMS
thread and the SurrogateLockerThread (SLT) which results ultimately in a
deadlock between the VMThread, CMS thread and the SLT thread. I am
looking for feedback and possibly a fix for this problem from the CMS
developers at SUN.
The root of the problem is that the implementation of Monitor::wait can
block at SafepointSynchronize::block before waiting on the condition
variable associated with the monitor, keeping the mutex of the critical
region held. Normally this isn't a problem since any synchronization
associated with the Monitor can be postponed until after the GC. However
if the monitor itself is part of the garbage collection mechanism, then
it is a problem. This has been the nature of the CMS problem that I've
been debugging. A monitor that falls into this category is SLT_lock.
Below is the complete sequence of events that lead to deadlock between
the CMS thread (background GC), the SLT (Surrogate locker thread) and
foreground GC (VMthread). The CMS thread and SLT thread synchronize
using the monitor SLT_lock; if the CMS thread is unable to proceed then
the foreground GC keeps waiting causing the system hang.
----
1. Java threads initiate safepoint synchronization.
2. Meanwhile CMS thread is executing
ConcurrentMarkSweepThread::manipulatePLL which does:
a) SLT_lock.lock() followed by
b) SLT_lock.notify() indicating a "message" has been sent to the SLT
thread.
c) It then goes and waits using STL_lock.wait( no_safepoint ).
3. An already waiting SLT thread (java thread) wakes (in function
SurrogateLockerThread::loop) up by the notification from (2b) and then
carries out the action associated with the "message".
4. The SLT thread then does
a) SLT_lock.lock() which it is able to acquire because (2c) released it.
b) Does SLT_lock.notify() to resume the CMS thread waiting at (2c).
c) Does a safepointed wait (since it is a Java thread) using
SLT_lock.wait(). However because safepoint synchronization was already
initiated, the code blocks at SafepointSynchronize::block. The wait on
the condition variable will be executed only after returning from the
safepoint - the mutex of the critical region is kept held. This impl. is
in Monitor::wait.
5. The CMS thread waiting from (2c) receives the condition variable
signal from (4c), resumes execution, and attempts to grab the SLT_lock
mutex (all this within a pthread_cond_wait or equivalent call), which is
the normal monitor behavior. It is unable to own the lock and waits
since SLT thread is waiting at a safepoint while holding
the SLT_lock mutex. The CMS thread is unable to proceed with its
execution.
6. The system is at safepoint and the VM thread starts executing
foreground GC. The foreground CMS collection algorithm requires the
background thread to inform of "okay to switchover" from background GC
to foreground GC. Because the CMS thread is still stuck at
SLT_lock.wait in (5), the foreground collector has to keep waiting.
This results in a deadlock!
atg hang with b32 in C1 mode using CMS collector after 1 hour 16 minutes. The hang could be an instance of bug 4962516.
The test machine is jtg-linux4.sfbay
###@###.### 2003-12-22
Stack trace from atg hang:
Thread 60 (Thread 1100073776 (LWP 13089)):
#0 0xffffe002 in ?? ()
#1 0x4003a5d5 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
#2 0x402ef9e9 in os::Linux::safe_cond_wait(pthread_cond_t*,
pthread_mutex_t*)
() from /usr/j2se/jre/lib/i386/client/libjvm.so
#3 0x402dce31 in Monitor::wait(int, long) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#4 0x4015a23d in
ConcurrentMarkSweepThread::manipulatePLL(SurrogateLockerThread::SLT_msg_type)
() from /usr/j2se/jre/lib/i386/client/libjvm.so
#5 0x4014d8ec in CMSCollector::collect_in_background(int) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#6 0x4015952a in ConcurrentMarkSweepThread::run() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#7 0x402f0704 in _start(Thread*) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#8 0x40038484 in start_thread () from /lib/tls/libpthread.so.0
Thread 59 (Thread 1100122928 (LWP 13090)):
#0 0xffffe002 in ?? ()
#1 0x402ef9e9 in os::Linux::safe_cond_wait(pthread_cond_t*,
pthread_mutex_t*)
() from /usr/j2se/jre/lib/i386/client/libjvm.so
#2 0x402dce31 in Monitor::wait(int, long) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#3 0x4014d13d in CMSCollector::acquire_control_and_collect(int, int) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#4 0x4014ceb4 in ConcurrentMarkSweepGeneration::collect(int, int,
unsigned, int, int) () from /usr/j2se/jre/lib/i386/client/libjvm.so
#5 0x4017802f in GenCollectedHeap::do_collection(int, int, unsigned,
int, int, int, int*) () from /usr/j2se/jre/lib/i386/client/libjvm.so
#6 0x40138d31 in
TwoGenerationCollectorPolicy::satisfy_failed_allocation(unsigned, int,
int, int*) () from /usr/j2se/jre/lib/i386/client/libjvm.so
#7 0x40178292 in GenCollectedHeap::satisfy_failed_allocation(unsigned,
int, int, int*) () from /usr/j2se/jre/lib/i386/client/libjvm.so
#8 0x4037c784 in VM_GenCollectForAllocation::doit() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#9 0x4037c4c6 in VM_Operation::evaluate() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#10 0x4037bb37 in VMThread::evaluate_operation(VM_Operation*) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#11 0x4037bd45 in VMThread::loop() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#12 0x4037b950 in VMThread::run() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#13 0x402f0704 in _start(Thread*) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#14 0x40038484 in start_thread () from /lib/tls/libpthread.so.0
###@###.### 2003-12-22
thread and the SurrogateLockerThread (SLT) which results ultimately in a
deadlock between the VMThread, CMS thread and the SLT thread. I am
looking for feedback and possibly a fix for this problem from the CMS
developers at SUN.
The root of the problem is that the implementation of Monitor::wait can
block at SafepointSynchronize::block before waiting on the condition
variable associated with the monitor, keeping the mutex of the critical
region held. Normally this isn't a problem since any synchronization
associated with the Monitor can be postponed until after the GC. However
if the monitor itself is part of the garbage collection mechanism, then
it is a problem. This has been the nature of the CMS problem that I've
been debugging. A monitor that falls into this category is SLT_lock.
Below is the complete sequence of events that lead to deadlock between
the CMS thread (background GC), the SLT (Surrogate locker thread) and
foreground GC (VMthread). The CMS thread and SLT thread synchronize
using the monitor SLT_lock; if the CMS thread is unable to proceed then
the foreground GC keeps waiting causing the system hang.
----
1. Java threads initiate safepoint synchronization.
2. Meanwhile CMS thread is executing
ConcurrentMarkSweepThread::manipulatePLL which does:
a) SLT_lock.lock() followed by
b) SLT_lock.notify() indicating a "message" has been sent to the SLT
thread.
c) It then goes and waits using STL_lock.wait( no_safepoint ).
3. An already waiting SLT thread (java thread) wakes (in function
SurrogateLockerThread::loop) up by the notification from (2b) and then
carries out the action associated with the "message".
4. The SLT thread then does
a) SLT_lock.lock() which it is able to acquire because (2c) released it.
b) Does SLT_lock.notify() to resume the CMS thread waiting at (2c).
c) Does a safepointed wait (since it is a Java thread) using
SLT_lock.wait(). However because safepoint synchronization was already
initiated, the code blocks at SafepointSynchronize::block. The wait on
the condition variable will be executed only after returning from the
safepoint - the mutex of the critical region is kept held. This impl. is
in Monitor::wait.
5. The CMS thread waiting from (2c) receives the condition variable
signal from (4c), resumes execution, and attempts to grab the SLT_lock
mutex (all this within a pthread_cond_wait or equivalent call), which is
the normal monitor behavior. It is unable to own the lock and waits
since SLT thread is waiting at a safepoint while holding
the SLT_lock mutex. The CMS thread is unable to proceed with its
execution.
6. The system is at safepoint and the VM thread starts executing
foreground GC. The foreground CMS collection algorithm requires the
background thread to inform of "okay to switchover" from background GC
to foreground GC. Because the CMS thread is still stuck at
SLT_lock.wait in (5), the foreground collector has to keep waiting.
This results in a deadlock!
atg hang with b32 in C1 mode using CMS collector after 1 hour 16 minutes. The hang could be an instance of bug 4962516.
The test machine is jtg-linux4.sfbay
###@###.### 2003-12-22
Stack trace from atg hang:
Thread 60 (Thread 1100073776 (LWP 13089)):
#0 0xffffe002 in ?? ()
#1 0x4003a5d5 in pthread_cond_wait@@GLIBC_2.3.2 ()
from /lib/tls/libpthread.so.0
#2 0x402ef9e9 in os::Linux::safe_cond_wait(pthread_cond_t*,
pthread_mutex_t*)
() from /usr/j2se/jre/lib/i386/client/libjvm.so
#3 0x402dce31 in Monitor::wait(int, long) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#4 0x4015a23d in
ConcurrentMarkSweepThread::manipulatePLL(SurrogateLockerThread::SLT_msg_type)
() from /usr/j2se/jre/lib/i386/client/libjvm.so
#5 0x4014d8ec in CMSCollector::collect_in_background(int) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#6 0x4015952a in ConcurrentMarkSweepThread::run() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#7 0x402f0704 in _start(Thread*) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#8 0x40038484 in start_thread () from /lib/tls/libpthread.so.0
Thread 59 (Thread 1100122928 (LWP 13090)):
#0 0xffffe002 in ?? ()
#1 0x402ef9e9 in os::Linux::safe_cond_wait(pthread_cond_t*,
pthread_mutex_t*)
() from /usr/j2se/jre/lib/i386/client/libjvm.so
#2 0x402dce31 in Monitor::wait(int, long) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#3 0x4014d13d in CMSCollector::acquire_control_and_collect(int, int) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#4 0x4014ceb4 in ConcurrentMarkSweepGeneration::collect(int, int,
unsigned, int, int) () from /usr/j2se/jre/lib/i386/client/libjvm.so
#5 0x4017802f in GenCollectedHeap::do_collection(int, int, unsigned,
int, int, int, int*) () from /usr/j2se/jre/lib/i386/client/libjvm.so
#6 0x40138d31 in
TwoGenerationCollectorPolicy::satisfy_failed_allocation(unsigned, int,
int, int*) () from /usr/j2se/jre/lib/i386/client/libjvm.so
#7 0x40178292 in GenCollectedHeap::satisfy_failed_allocation(unsigned,
int, int, int*) () from /usr/j2se/jre/lib/i386/client/libjvm.so
#8 0x4037c784 in VM_GenCollectForAllocation::doit() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#9 0x4037c4c6 in VM_Operation::evaluate() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#10 0x4037bb37 in VMThread::evaluate_operation(VM_Operation*) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#11 0x4037bd45 in VMThread::loop() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#12 0x4037b950 in VMThread::run() ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#13 0x402f0704 in _start(Thread*) ()
from /usr/j2se/jre/lib/i386/client/libjvm.so
#14 0x40038484 in start_thread () from /lib/tls/libpthread.so.0
###@###.### 2003-12-22
- backported by
-
JDK-2097578 CMS thread/SLT deadlock problem
- Resolved
- duplicates
-
JDK-4972454 CMS: RH9 linux: b32: atg crash if -Xmx12m
- Closed