Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4962516

CMS thread/SLT deadlock problem

XMLWordPrintable

    • gc
    • 06
    • generic, x86, sparc
    • generic, linux_redhat_9.0, solaris_2.6, solaris_8

        There is a bug in the communication mechanism between the CMS
        thread and the SurrogateLockerThread (SLT) which results ultimately in a
        deadlock between the VMThread, CMS thread and the SLT thread. I am
        looking for feedback and possibly a fix for this problem from the CMS
        developers at SUN.

        The root of the problem is that the implementation of Monitor::wait can
        block at SafepointSynchronize::block before waiting on the condition
        variable associated with the monitor, keeping the mutex of the critical
        region held. Normally this isn't a problem since any synchronization
        associated with the Monitor can be postponed until after the GC. However
        if the monitor itself is part of the garbage collection mechanism, then
        it is a problem. This has been the nature of the CMS problem that I've
        been debugging. A monitor that falls into this category is SLT_lock.
        Below is the complete sequence of events that lead to deadlock between
        the CMS thread (background GC), the SLT (Surrogate locker thread) and
        foreground GC (VMthread). The CMS thread and SLT thread synchronize
        using the monitor SLT_lock; if the CMS thread is unable to proceed then
        the foreground GC keeps waiting causing the system hang.

        ----

        1. Java threads initiate safepoint synchronization.

        2. Meanwhile CMS thread is executing
        ConcurrentMarkSweepThread::manipulatePLL which does:
        a) SLT_lock.lock() followed by
        b) SLT_lock.notify() indicating a "message" has been sent to the SLT
        thread.
        c) It then goes and waits using STL_lock.wait( no_safepoint ).

        3. An already waiting SLT thread (java thread) wakes (in function
        SurrogateLockerThread::loop) up by the notification from (2b) and then
        carries out the action associated with the "message".

        4. The SLT thread then does
        a) SLT_lock.lock() which it is able to acquire because (2c) released it.
        b) Does SLT_lock.notify() to resume the CMS thread waiting at (2c).
        c) Does a safepointed wait (since it is a Java thread) using
        SLT_lock.wait(). However because safepoint synchronization was already
        initiated, the code blocks at SafepointSynchronize::block. The wait on
        the condition variable will be executed only after returning from the
        safepoint - the mutex of the critical region is kept held. This impl. is
        in Monitor::wait.

        5. The CMS thread waiting from (2c) receives the condition variable
        signal from (4c), resumes execution, and attempts to grab the SLT_lock
        mutex (all this within a pthread_cond_wait or equivalent call), which is
        the normal monitor behavior. It is unable to own the lock and waits
        since SLT thread is waiting at a safepoint while holding
        the SLT_lock mutex. The CMS thread is unable to proceed with its
        execution.

        6. The system is at safepoint and the VM thread starts executing
        foreground GC. The foreground CMS collection algorithm requires the
        background thread to inform of "okay to switchover" from background GC
        to foreground GC. Because the CMS thread is still stuck at
        SLT_lock.wait in (5), the foreground collector has to keep waiting.

        This results in a deadlock!


        atg hang with b32 in C1 mode using CMS collector after 1 hour 16 minutes. The hang could be an instance of bug 4962516.
        The test machine is jtg-linux4.sfbay

        ###@###.### 2003-12-22

        Stack trace from atg hang:
        Thread 60 (Thread 1100073776 (LWP 13089)):
        #0 0xffffe002 in ?? ()
        #1 0x4003a5d5 in pthread_cond_wait@@GLIBC_2.3.2 ()
            from /lib/tls/libpthread.so.0
        #2 0x402ef9e9 in os::Linux::safe_cond_wait(pthread_cond_t*,
        pthread_mutex_t*)
             () from /usr/j2se/jre/lib/i386/client/libjvm.so
        #3 0x402dce31 in Monitor::wait(int, long) ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #4 0x4015a23d in
        ConcurrentMarkSweepThread::manipulatePLL(SurrogateLockerThread::SLT_msg_type)
        () from /usr/j2se/jre/lib/i386/client/libjvm.so
        #5 0x4014d8ec in CMSCollector::collect_in_background(int) ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #6 0x4015952a in ConcurrentMarkSweepThread::run() ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #7 0x402f0704 in _start(Thread*) ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #8 0x40038484 in start_thread () from /lib/tls/libpthread.so.0


        Thread 59 (Thread 1100122928 (LWP 13090)):
        #0 0xffffe002 in ?? ()
        #1 0x402ef9e9 in os::Linux::safe_cond_wait(pthread_cond_t*,
        pthread_mutex_t*)
             () from /usr/j2se/jre/lib/i386/client/libjvm.so
        #2 0x402dce31 in Monitor::wait(int, long) ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #3 0x4014d13d in CMSCollector::acquire_control_and_collect(int, int) ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #4 0x4014ceb4 in ConcurrentMarkSweepGeneration::collect(int, int,
        unsigned, int, int) () from /usr/j2se/jre/lib/i386/client/libjvm.so
        #5 0x4017802f in GenCollectedHeap::do_collection(int, int, unsigned,
        int, int, int, int*) () from /usr/j2se/jre/lib/i386/client/libjvm.so
        #6 0x40138d31 in
        TwoGenerationCollectorPolicy::satisfy_failed_allocation(unsigned, int,
        int, int*) () from /usr/j2se/jre/lib/i386/client/libjvm.so
        #7 0x40178292 in GenCollectedHeap::satisfy_failed_allocation(unsigned,
        int, int, int*) () from /usr/j2se/jre/lib/i386/client/libjvm.so
        #8 0x4037c784 in VM_GenCollectForAllocation::doit() ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #9 0x4037c4c6 in VM_Operation::evaluate() ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #10 0x4037bb37 in VMThread::evaluate_operation(VM_Operation*) ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #11 0x4037bd45 in VMThread::loop() ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #12 0x4037b950 in VMThread::run() ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #13 0x402f0704 in _start(Thread*) ()
            from /usr/j2se/jre/lib/i386/client/libjvm.so
        #14 0x40038484 in start_thread () from /lib/tls/libpthread.so.0


        ###@###.### 2003-12-22

              ysr Y. Ramakrishna
              ksoshals Kirill Soshalskiy (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: