Loading...

XML

Word

Printable

Type: Bug
Resolution: Fixed
Priority: P4
Fix Version/s: hs11
Affects Version/s: 5.0u12, 6, 6u1, 6u3
Component/s: hotspot
Labels:
- licbug
- webbug

Subcomponent:
runtime
Resolved In Build:
b01
CPU:

x86, sparc
OS:

linux, linux_redhat_4.0, solaris_8, solaris_10
Verification:
Verified

Issue	Fix Version	Assignee	Priority	Status	Resolution	Resolved In Build
JDK-2176972	7	Xiaobin Lu	P3	Closed	Fixed	b15
JDK-2150326	6u4	Kevin Walls	P4	Closed	Fixed	b01
JDK-2150325	5.0u14	Kevin Walls	P4	Closed	Fixed	b01

FULL PRODUCT VERSION :
Hotspot/Java:

- 1.6.0 b105
- sources:
jdk-6-fcs-bin-b105-jrl-29_nov_2006.jar
jdk-6-fcs-src-b105-jrl-29_nov_2006.jar
- build options: STATIC_MOTIF=false

FULL OS VERSION :
- uname: Linux b1c1s9 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006
x86_64 x86_64 x86_64 GNU/Linux
- RHEL 4, (patch level 4)
- 2xDual Core Intel Xenon CPUs, (shows as 8-way machine)

A DESCRIPTION OF THE PROBLEM :
The problem is detected as relatively rare random 7-30 seconds
application pauses. Typically, these occur once every 1-4 hours in
production. With application pause time tracking enabled, the problem
can be easily seen in output logs as "application stopped" time. During
these stoppage times, a full CPU is being consumed in kernel mode.

After building the JVM from source and inserting debugging statements in
various places, we were able to determine that the pause was the result
of a synchronization problem in the psuedo memory barrier code that
attempts to control multiple processor JVM safe point entry.

We verified this by attempting to use the reinstated -XX:+UseMembar
option. This did appear to clear the problem, however the overall
performance of the system was not acceptable with this option invoked
since it uses a true memory barrier instruction to synchronized the
multiple processors.

Further investigation into the problem pointed to a race condition and
associated thread starvation during entry into the JVM global safe
point. The psuedo memory barrier code is dependent on SIGSEGV error
processing generated while attempting to access a block of shared memory
protected by another thread. While one thread was blocked trying to
protect the shared memory to enter the safe point, another thread looped
repeatedly in the SIGSEGV handler code. This continued for random
lengths of time until the protecting thread managed to get a time slice
on the same CPU.

We believe this appears random because it only occurs on safe point
entry when there are other threads executing and when the thread trying
to force the safe point and the outstanding threads are on the same CPU.
It also appears to happen very frequently, but long pauses seem to occur
only rarely: often the number of iterations through the SIGSEGV loop are
less than 10 and the pause escapes detection.

THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
See description

EXPECTED VERSUS ACTUAL BEHAVIOR :
See description
ERROR MESSAGES/STACK TRACES THAT OCCUR :
Not available

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
Not available
---------- END SOURCE ----------

CUSTOMER SUBMITTED WORKAROUND :
We can make available a patch that we are using successfully under production
loads. This patch tracks the number of times a thread iterates through
the SIGSEGV handler and yields the CPU to the safepoint serializing
thread if the count exceeds 10. This eliminates the longer pauses while
still allowing the loop to "spin" as it does naturally frequently.

We are not sure this is the optimal patch, but it does clearly
demonstrate the issue we were encountering with the psudeo memory
barrier implementation in our system environments.
Fixed mis-spelling of "pseudo" in Synopsis field.

backported by

JDK-2176972 Synchronization problem in the pseudo memory barrier code

Closed

JDK-2150325 Synchronization problem in the pseudo memory barrier code

Closed

JDK-2150326 Synchronization problem in the pseudo memory barrier code

Closed

duplicates

JDK-6596629 System hangs/freezes for a period during a long running test of Java Application - GC logs show gaps

Closed

JDK-6876168 JVM stalls with high system time observed

Closed

relates to

JDK-6518490 Solaris TS scheduling class anti-starvation facility does not completely avoid starvation

Closed

(1 relates to)

Assignee:: Xiaobin Lu (Inactive)

Reporter:: Nelson Dcosta (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2007-04-16 01:53

Updated:: 2011-03-07 16:35

Resolved:: 2011-03-07 16:35

Imported:: 17/Sep/12 7:15 AM

Indexed:: 19/Jul/12 10:28 PM

Details

Backports

Description

Attachments

Issue Links

Activity

People

Dates