Customer Problem Description:
------------------------------
After 3+ hours of execution, our Java application's VM is hanging. We cannot even signal the VM to do a thread dump. We are able to consistently reproduce this problem by running our application for an extended period of time under a considerable load. This problem is detrimental to the stability of our application and is a top priority for us.
To diagnose the problem, we downloaded to JDK 1.3.0 source and debugged it. The problem is a deadlock in the safepoint synchronization code. At the time of the deadlock, SafepointSynchronize::_waiting_to_block has a value greater than zero, while all threads appear to be at their respective safepoints.
As you know, the compiled safepoint handler copies the currently executing method and replaces almost all calls out of it with illegal instructions, which are later trapped to stop the thread. If the compiled code makes a JVM call, the thread's state changes from _thread_in_native_trans to _thread_in_native_blocked. If the thread's state is _thread_in_native_blocked when SafepointSynchronize::examine_state_of_thread() is called, roll_forward(_at_safepoint) is called. This causes the main thread to decrement _waiting_to_block on behalf of the running thread, because it is assumed to already be in a safepoint. However, in our case, the initial state of the thread is _thread_in_native. The main thread expects the running thread to decrement _waiting_to_block once it enters a safepoint. But, once our thread reaches the safepoint, it is in _thread_in_native_blocked. In this case SafepointSynchronize::block() does not decrement _waiting_to_block. Therefore, the main thread blocks indefinately waiting for _waiting_to_block to decrement to 0.
The application consists of four java processes:
1. A 100% Java home grown message server. This server communicates with
the other components using TCP connections. This process does NOT
experience the deadlock problem.
2. An application server process. This consists of 47 threads. Each
thread is a subcomponent of the system. Each subcomponent requests data
objects from the message server, performs an action on the data, and
outputs processed data objects into the message server. The
subcomponents form a 6 stage pipeline in which work is performed. Some
stages have multiple threads, and there are some application management
threads, so that's how we wind up with 47 of them. This process DOES
experience the deadlock and does so more frequently than #3.
3. A cloudscape database server process. This process DOES experience
the deadlock.
4. A tomcat web server process. This process DOES NOT experience the
deadlock.
We have tested the application against 1.3.1 and we do experience the
deadlock there as well.
Here's some text/explainations I've been putting together for my own sanity. I've included some trace diagnostics. You'll notice some fields that don't look familiar. The most useful of these is _decremented_block.
This is a per-thread variable that gets set whenever a thread decrements
_waiting_to_block. SafepointSynchronize::begin() sets this value to 0
for all threads before the still_waiting loop. If the _waiting_to_block
is decremented on behalf of a thread by roll_forward(_at_safepoint),
_decremented_block is set to 2. If the decrement is done by
SafepointSynchronize::block(), _decremented_block is set to 1. If
_decremented_block_ is 0, the decrement has not been done.
-- attached diagnostic data --
Situation : Compiled code handler
Hang occurred after 3.5 hours of operation running Emissary. Trace log
had the following status repeated:
Threads_lock timeout: 0x1a24a870 [0x42c] _waiting_to_block=1
block_signaled=1 is_synchronizing=1 count=1
THE HANG occurred
num_threads = 30 init_waiting=30 stage2_wtb=2 _waiting_to_block=1
Thread: 0x1a1d42a0 [0x408] State: _compiled_handler ThreadState: 50
pc: 0x98d7e3 _decremented_block=0
SafepointHandler: CompiledCodeSafepointHandler
Thread: 0x1a1e2898 [0x3f8] State: _at_safepoint ThreadState: 11 pc:
0x0 _decremented_block=2
Thread: 0x1a1cbc58 [0x728] State: _at_safepoint ThreadState: 4 pc: 0x0
_decremented_block=2
Thread: 0x1a1c6420 [0x460] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x78b488 [0x868] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x78a818 [0x504] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a1aec80 [0x718] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1adea8 [0x970] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1ad3d8 [0x63c] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1ac990 [0x208] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1bc4b8 [0x210] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1ebca0 [0x588] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x8312c8 [0x628] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a24a590 [0x498] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a24a408 [0x6ec] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x786a60 [0x68c] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x7868f0 [0x540] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a24a870 [0x42c] State: _call_back ThreadState: 7 pc: 0x0
_decremented_block=1
Thread: 0x7888a8 [0x3b4] State: _at_safepoint ThreadState: 50 pc: 0x0
_decremented_block=2
Thread: 0x786df8 [0x450] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1afb48a8 [0x6b0] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x787618 [0x5c8] State: _at_safepoint ThreadState: 11 pc: 0x0
_decremented_block=2
Thread: 0x835da0 [0x704] State: _at_safepoint ThreadState: 11 pc: 0x0
_decremented_block=2
Thread: 0x1a1a3710 [0x880] State: _at_safepoint ThreadState: 50 pc:
0x0 _decremented_block=2
Thread: 0x1a16ed70 [0x3a8] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a2380b8 [0x1f8] State: _at_safepoint ThreadState: 11 pc:
0x0 _decremented_block=2
Thread: 0x8122f0 [0x1e4] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x80f920 [0x144] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a0b0c60 [0x720] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x7844c0 [0x590] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread 0x42c printed out a timeout message, as part of fix 1. It
appears to be unrelated, since thread 0x408 is the one that did not
decrement waiting_to_block.
Thread 0x408 has the following stack dump:
Reported by Debugger Interpretation
NTDLL! 77f827e8()
KERNEL32! 77e86a3d()
JVM! 08047031() SafepointSynchronize::block()safepoint.cpp line 322
JVM! 0802713a() JNI_GetPrimitiveArrayCriticaljni.cpp line 1451
Java_java_util_zip_CRC32_updateBytes(const JNINativeInterface_ * *
0x1a1d42f4, _jobject * 0x1cfafa6c, long 0, _jobject * 0x1cfafa80, long
0, long 4088) line 38 + 18 bytes
0x408's current state is 50 (_thread_in_native_blocked). The particular
section of SafepointSynchronize:block() in which the thread is blocked
is structured as follows:
switch (state)
{
// Code deleted
case _thread_in_native_trans:
case _thread_blocked_trans:
case _thread_new_trans:
// Code and comments deleted
if (state == _thread_in_native_trans)
thread->set_thread_state(_thread_in_native_blocked);
// Block until the safepoint operation is completed.
Threads_lock->lock_without_safepoint_check();
The thread's state could not have been _thread_in_native_blocked when
the function was entered, since it would not have made it into this
section of the switch statement. Furthermore, the only way it could
have changed to _thread_in_native_blocked, would have been if its value
was _thread_in_native_trans. This state is consistent with the
transition code on the stack below it. Therefore, this thread's state
must have been _thread_in_native_trans when the function was entered.
The thread's safepoint state is recorded as _compiled_handler. This
means that its state was _thread_in_comp_leaf or _thread_in_Java when
examine_thread_state() was called. The _compiled_handler state expects
the thread to decrement _waiting_for_block. However, when
SafepointSynchronize::block() is entered, the state causes a jump to a
block of code that is normally called when the examine_thread_state()
has already called roll_forward(_at_safepoint). Since roll_forward()
automatically decrements _waiting_to_block when the safepoint state is
set to _at_safepoint, block() does not decrement it again.
decremented_block is 0 meaning none has decremented _waiting_for_block
on its behalf.
-- end of diagnostic data --
------------------------------
After 3+ hours of execution, our Java application's VM is hanging. We cannot even signal the VM to do a thread dump. We are able to consistently reproduce this problem by running our application for an extended period of time under a considerable load. This problem is detrimental to the stability of our application and is a top priority for us.
To diagnose the problem, we downloaded to JDK 1.3.0 source and debugged it. The problem is a deadlock in the safepoint synchronization code. At the time of the deadlock, SafepointSynchronize::_waiting_to_block has a value greater than zero, while all threads appear to be at their respective safepoints.
As you know, the compiled safepoint handler copies the currently executing method and replaces almost all calls out of it with illegal instructions, which are later trapped to stop the thread. If the compiled code makes a JVM call, the thread's state changes from _thread_in_native_trans to _thread_in_native_blocked. If the thread's state is _thread_in_native_blocked when SafepointSynchronize::examine_state_of_thread() is called, roll_forward(_at_safepoint) is called. This causes the main thread to decrement _waiting_to_block on behalf of the running thread, because it is assumed to already be in a safepoint. However, in our case, the initial state of the thread is _thread_in_native. The main thread expects the running thread to decrement _waiting_to_block once it enters a safepoint. But, once our thread reaches the safepoint, it is in _thread_in_native_blocked. In this case SafepointSynchronize::block() does not decrement _waiting_to_block. Therefore, the main thread blocks indefinately waiting for _waiting_to_block to decrement to 0.
The application consists of four java processes:
1. A 100% Java home grown message server. This server communicates with
the other components using TCP connections. This process does NOT
experience the deadlock problem.
2. An application server process. This consists of 47 threads. Each
thread is a subcomponent of the system. Each subcomponent requests data
objects from the message server, performs an action on the data, and
outputs processed data objects into the message server. The
subcomponents form a 6 stage pipeline in which work is performed. Some
stages have multiple threads, and there are some application management
threads, so that's how we wind up with 47 of them. This process DOES
experience the deadlock and does so more frequently than #3.
3. A cloudscape database server process. This process DOES experience
the deadlock.
4. A tomcat web server process. This process DOES NOT experience the
deadlock.
We have tested the application against 1.3.1 and we do experience the
deadlock there as well.
Here's some text/explainations I've been putting together for my own sanity. I've included some trace diagnostics. You'll notice some fields that don't look familiar. The most useful of these is _decremented_block.
This is a per-thread variable that gets set whenever a thread decrements
_waiting_to_block. SafepointSynchronize::begin() sets this value to 0
for all threads before the still_waiting loop. If the _waiting_to_block
is decremented on behalf of a thread by roll_forward(_at_safepoint),
_decremented_block is set to 2. If the decrement is done by
SafepointSynchronize::block(), _decremented_block is set to 1. If
_decremented_block_ is 0, the decrement has not been done.
-- attached diagnostic data --
Situation : Compiled code handler
Hang occurred after 3.5 hours of operation running Emissary. Trace log
had the following status repeated:
Threads_lock timeout: 0x1a24a870 [0x42c] _waiting_to_block=1
block_signaled=1 is_synchronizing=1 count=1
THE HANG occurred
num_threads = 30 init_waiting=30 stage2_wtb=2 _waiting_to_block=1
Thread: 0x1a1d42a0 [0x408] State: _compiled_handler ThreadState: 50
pc: 0x98d7e3 _decremented_block=0
SafepointHandler: CompiledCodeSafepointHandler
Thread: 0x1a1e2898 [0x3f8] State: _at_safepoint ThreadState: 11 pc:
0x0 _decremented_block=2
Thread: 0x1a1cbc58 [0x728] State: _at_safepoint ThreadState: 4 pc: 0x0
_decremented_block=2
Thread: 0x1a1c6420 [0x460] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x78b488 [0x868] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x78a818 [0x504] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a1aec80 [0x718] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1adea8 [0x970] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1ad3d8 [0x63c] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1ac990 [0x208] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1bc4b8 [0x210] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1ebca0 [0x588] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x8312c8 [0x628] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a24a590 [0x498] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a24a408 [0x6ec] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x786a60 [0x68c] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x7868f0 [0x540] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a24a870 [0x42c] State: _call_back ThreadState: 7 pc: 0x0
_decremented_block=1
Thread: 0x7888a8 [0x3b4] State: _at_safepoint ThreadState: 50 pc: 0x0
_decremented_block=2
Thread: 0x786df8 [0x450] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1afb48a8 [0x6b0] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x787618 [0x5c8] State: _at_safepoint ThreadState: 11 pc: 0x0
_decremented_block=2
Thread: 0x835da0 [0x704] State: _at_safepoint ThreadState: 11 pc: 0x0
_decremented_block=2
Thread: 0x1a1a3710 [0x880] State: _at_safepoint ThreadState: 50 pc:
0x0 _decremented_block=2
Thread: 0x1a16ed70 [0x3a8] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a2380b8 [0x1f8] State: _at_safepoint ThreadState: 11 pc:
0x0 _decremented_block=2
Thread: 0x8122f0 [0x1e4] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x80f920 [0x144] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a0b0c60 [0x720] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x7844c0 [0x590] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread 0x42c printed out a timeout message, as part of fix 1. It
appears to be unrelated, since thread 0x408 is the one that did not
decrement waiting_to_block.
Thread 0x408 has the following stack dump:
Reported by Debugger Interpretation
NTDLL! 77f827e8()
KERNEL32! 77e86a3d()
JVM! 08047031() SafepointSynchronize::block()safepoint.cpp line 322
JVM! 0802713a() JNI_GetPrimitiveArrayCriticaljni.cpp line 1451
Java_java_util_zip_CRC32_updateBytes(const JNINativeInterface_ * *
0x1a1d42f4, _jobject * 0x1cfafa6c, long 0, _jobject * 0x1cfafa80, long
0, long 4088) line 38 + 18 bytes
0x408's current state is 50 (_thread_in_native_blocked). The particular
section of SafepointSynchronize:block() in which the thread is blocked
is structured as follows:
switch (state)
{
// Code deleted
case _thread_in_native_trans:
case _thread_blocked_trans:
case _thread_new_trans:
// Code and comments deleted
if (state == _thread_in_native_trans)
thread->set_thread_state(_thread_in_native_blocked);
// Block until the safepoint operation is completed.
Threads_lock->lock_without_safepoint_check();
The thread's state could not have been _thread_in_native_blocked when
the function was entered, since it would not have made it into this
section of the switch statement. Furthermore, the only way it could
have changed to _thread_in_native_blocked, would have been if its value
was _thread_in_native_trans. This state is consistent with the
transition code on the stack below it. Therefore, this thread's state
must have been _thread_in_native_trans when the function was entered.
The thread's safepoint state is recorded as _compiled_handler. This
means that its state was _thread_in_comp_leaf or _thread_in_Java when
examine_thread_state() was called. The _compiled_handler state expects
the thread to decrement _waiting_for_block. However, when
SafepointSynchronize::block() is entered, the state causes a jump to a
block of code that is normally called when the examine_thread_state()
has already called roll_forward(_at_safepoint). Since roll_forward()
automatically decrements _waiting_to_block when the safepoint state is
set to _at_safepoint, block() does not decrement it again.
decremented_block is 0 meaning none has decremented _waiting_for_block
on its behalf.
-- end of diagnostic data --
- duplicates
-
JDK-4513101 VMThread stuck in SafepointSynchronize::begin() during stress profiling
- Closed
- relates to
-
JDK-4641232 Slow case for ArrayCopyStub allows method to escape compiled code safepoint
- Resolved
-
JDK-4681171 fatal in safepoint code, core build running mtrt on windows
- Closed