VM crash bug reported by one of CAP members. Their tests are not
very portable, it would take a lot of work to set up and run.
Error message(hs_err_pid26796.log) and core files were attached instead.
They are willing to work with someone in HotSpot team to narrow it down
and produce a case outside their product code if possible.
----------------------------------------------------------------------------------
J2SE Version (please include all output from java -version flag):
java version "1.4.0-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-beta-b65)
Java HotSpot(TM) Client VM (build 1.4.0-beta-b65, mixed mode)
Does this problem occur on J2SE 1.3? Yes / No (pick one)
No
Operating System Configuration Information (be specific):
Solaris 7 with the latest patches.
Hardware Configuration Information (be specific):
This was run on a Sun 420R quad processor with 1 GB RAM
Bug Description:
When using more than one thread in a section of code it is possible
(it seems) for HotSpot to zero out the instructions which one of the
two (or more) threads is currently executing (or will execute in the
near future). This appears to happen after executing the code several
hundred times, as though HotSpot is coming back and reoptimizing the
code segment by zeroing the old instructions out and then letting another
thread kill the JVM because the SPARC processor cannot execute the
instruction 0x0.
Detail problem description from customer:
+++++++++++++++++++++++++++++++++++++++++
JVM was configured with:
Tests 1 - 7: -server -Xms128m -Xmx512m
Tests 8: -server -Xms512m -Xmx512m
Tests 9: -server -Xms512m -Xmx512m
Tests:
Test 1:
Running any test against MapXtreme Java causes the JVM to crash.
Every crash is caused specifically by Solaris delivering a SIGILL
to one of the threads running inside of the JVM. The JVM traps
the SIGILL (signal 4) and prints an error report to the console,
and then calls abort() to create a core file. Note that abort()
kills its calling process by sending a SIGKILL (signal 9) to itself.
This "test" was actually run many times with differing user loads.
The number of virtual users ranged from 50 down to 1, always running
in stress mode (no think time). In all tests involving more than 10
users, 10 users were started at test start, and the number of users
was ramped up at a rate of 10 users per minute.
Crashes produced during this test always had the same core image
file "appearance". Many threads were active at once (say 50 threads),
but only one had received the SIGILL which caused the process to shutdown.
Additionally, the stack frame for that thread had 24 HotSpot created
functions on the stack frame. (HotSpot created functions do not
have symbol entries in the symbol tables and as a result appear
"??" in GDB.)
Oddly enough, all memory around the instruction which caused the
SIGILL (and the instruction itself) are zerod out in the core file.
Is this a "feature" of the core dump facility on Solaris, a bug
in GDB or really what happened to the JVM? (i.e. did HotSpot or
the GC zero out memory it wasn't supposed to zero out?) I don't
think this is a GDB bug, as other core files created due to SIGSEGV
seem to have legal SPARC instructions at the reported PC.
Test 2:
What effect does single threading the server have by reconfiguring
Silk to only issue one request at a time per user? If this the
crash is thread related we should not see a crash in a single
threaded case.
One virtual user was configured to issue only one request at a time,
but with no think time between requests.
Result: 3,559 requests in 28 minutes. No errors.
Test 3:
If test #2 holds true, that single threading works, then running only
two threads may be able to reproduce the crash.
One virtual user was configured to issue two concurrent requests
at a time, with no think time between requests.
Result: crash around 5,000 requests after 10 minutes of load. Crash
appears to be the same as observed in test #1. (SIGILL delivered
with 24 stack frames of HotSpot created functions.)
Test 4:
According to the help for the "java" command, -Xbatch should prevent
HotSpot from replacing the code of a method at runtime. Since the
crash occurs only after processing several hundred requests successfully,
the crash must be caused by either replacement code generated by
HotSpot in the middle of the test or a periodic background cleanup
task triggering off. Since the single user case held, I'm leaning
towards the former case. I'm also thinking that perhaps HotSpot is
changing the code for a method while another thread is attempting to
execute it, and the crash is because the executing thread is seeing
the machine code in the middle of the change (an unstable state).
Result: Using -Xbatch just causes the JVM to suffer from many
NoClassDefFoundError exceptions in com.ibm.xml.parser.Token.getName.
Very odd error. I can only conclude that -Xbatch does not work in
this version of the JVM. This test was run many times with the
same result. The good news is the JVM does not SIGILL when using
-Xbatch, but I don't think it gets far enough to receive the SIGILL point.
Additionally, the stack trace created from the NoClassDefFoundError
is 34 methods deep. Unless HotSpot was able to inline 10 methods to
reduce the stack depth, this NoClassDefFoundError cannot be the problem
we are seeing in the other 3 tests. On top of this, absolutely no
request completes successfully with -Xbatch enabled, so this is really
not of any help.
Test 5:
I split the client servlet (SimpleMapTestPlus) into its own JVM,
seperate from the MapXtreme Java server servlet. Both ran on the same
machine underneath the same Apache server.
The result was the same as all other tests (except test #4). The
server JVM process died with a SIGILL and its stack trace (as reported
by GDB) shows 24 HotSpot functions on the call stack.
Test 6:
Paul Jossman suggested using MapXtreme Java 4.0 build 20 as the XML
parser has been switched away from IBM's XML parser to Apache's Xerces
parser. This test was run with a single virtual user allowed to
make 4 concurrent connections to MapXtreme Java. Both client and
server servlets were in the same JVM, and no think time was allowed
between requests.
The JVM made it through about 4,000 requests before it died. Its
death yielded the same SIGILL and 24 frame stack trace as every
other crash.
Test 7:
MapXtreme Java 4.0 build 20 was tested with -Xbatch enabled to see
if this cleared up the NoClassDefFoundError seen in Test 4.
After 186 successfully requests, the JVM died with a SIGSEGV (signal 11).
At least with the Xerces XML parser we do not see the ClassDefNotFoundError.
Examining the core file in gdb reveals that the thread which caught
the signal was executing in a JVM internal method:
int PhaseChaitin::strech_base_pointer_live_rangs(ResourceArea*)
Some of the classes calling this method seemed to refer to the runtime
HotSpot compiler. Perhaps the reason for the crash is an invalid
pointer dereference in the HotSpot compiler itself.
Test 8:
Under the assumption that test 7's error was a result of trying to
increase the size of the heap during runtime, I resized the initial
heap to be the same as the maximum. However, since the call stack
contained references to the "Compiler" object, I doubt this is the case.
The test ran successfully for an hour, completing 18,278 image requests
for one virtual user, no think time and 4 concurrent connections.
12 hours after the test completed (around 4:20 am) when there was no
load on the server the JVM randomly SIGSEGVd.
Conclusions:
It would seem as though this particular version of the JVM has some
errors in its HotSpot code generator (the runtime compiler). On Solaris,
after a period of time we see the 32 bit JVM crash with a SIGILL
having been delivered to the process. The SIGILL is always issued at
the same point in our software: 24 stack frames down on the runtime
stack. Each of these entries is a dynamicly created method, with only
the HotSpot error trapping code on one end of the stack frame and the
JVM thread root functions on the other. There appear to be 9 functions
assocaited with the Tomcat call stack, leaving the last 15 functions
to be (possibly) ones from MapXtreme Java's server servlet.
It would seem as though the crash occurs after handling about 533
requests. Typically the easiest way to cause the crash is to run
MapXtreme Java with 1 user stress test loading all 13 images in
the MapXtreme Java 3.1 test Shawn B. created. In this test, Silk
is running 4 concurrent connections per user to the server.
Perhaps the issue with multiple threads is that HotSpot has recreated
the instructions for the method, but has somehow screwed up in copying
the new instructions into the method's storage in memory. As a result,
some other thread calls into the new method before the new method is
truely ready for execution, and trys to execute a partially complete
instruction, or something which was not an instruction (but rather older
data laying in memory). Is the invalid instruction a null word or
something silly like that?
It would seem as though load has no bering on when this crash will
occur, as even one user (with no think time) can bring the JVM down with
this error. I now have 3 core dumps showing identical stack traces from
a thread dying with this SIGILL. Unfortunately, since HotSpot
generates the code on the fly, there is no symbol table associated
with the stack frames to uncover what method of MapXtreme Java is
causing the error. If we can identify the current method of the thread
that received the SIGILL signal, perhaps we can give Sun a test case
which can reproduce the error.
With newer releases of the 1.4 JVM, we have to wonder if this
particular HotSpot bug has been fixed, or if the bug is still present
but other changes to HotSpot's runtime compiler will cause the number
of stack frames seen and their alignment in memory to be shifted such
that it doesn't appear to be the same error.
All fingers point to MapXtreme Java as the software causing HotSpot
to generate illegal machine code, as the NullServlet test with Tomcat
did not suffer from this runtime problem. Since we can run several
hundred successful requests through the server before the crash, I am
lead to believe that either HotSpot recompiles the victim method
improperly as a performance improvement, or that the victim method
really is just called only infrequently by MapXtreme Java (and as it
happens is only called once every few hundred requests as a background
cleanup process for example).
Update:
After examining most of the core files from the tests, it would appear
as though the memory has been zeroed around the instruction which is
causing the illegal instruction. I examined the stacks of every
active thread visible in the core files, and it would appear as though
the memory was zeroed while the victim thread was sleeping. When it
woke up it died with a SIGILL. It is not known when the memory zeroing
occured - it may have occured while the thread was waiting for IO or a
system call (or AWT) call to complete, and then it called a HotSpot'd
method which had been zeroed over. Or it was preempted, and while
waiting for control one or more methods were zeroed behind its back.
This really looks like a race condition, and the method being zeroed
looks like its also a MapXtreme Java server side method.
very portable, it would take a lot of work to set up and run.
Error message(hs_err_pid26796.log) and core files were attached instead.
They are willing to work with someone in HotSpot team to narrow it down
and produce a case outside their product code if possible.
----------------------------------------------------------------------------------
J2SE Version (please include all output from java -version flag):
java version "1.4.0-beta"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-beta-b65)
Java HotSpot(TM) Client VM (build 1.4.0-beta-b65, mixed mode)
Does this problem occur on J2SE 1.3? Yes / No (pick one)
No
Operating System Configuration Information (be specific):
Solaris 7 with the latest patches.
Hardware Configuration Information (be specific):
This was run on a Sun 420R quad processor with 1 GB RAM
Bug Description:
When using more than one thread in a section of code it is possible
(it seems) for HotSpot to zero out the instructions which one of the
two (or more) threads is currently executing (or will execute in the
near future). This appears to happen after executing the code several
hundred times, as though HotSpot is coming back and reoptimizing the
code segment by zeroing the old instructions out and then letting another
thread kill the JVM because the SPARC processor cannot execute the
instruction 0x0.
Detail problem description from customer:
+++++++++++++++++++++++++++++++++++++++++
JVM was configured with:
Tests 1 - 7: -server -Xms128m -Xmx512m
Tests 8: -server -Xms512m -Xmx512m
Tests 9: -server -Xms512m -Xmx512m
Tests:
Test 1:
Running any test against MapXtreme Java causes the JVM to crash.
Every crash is caused specifically by Solaris delivering a SIGILL
to one of the threads running inside of the JVM. The JVM traps
the SIGILL (signal 4) and prints an error report to the console,
and then calls abort() to create a core file. Note that abort()
kills its calling process by sending a SIGKILL (signal 9) to itself.
This "test" was actually run many times with differing user loads.
The number of virtual users ranged from 50 down to 1, always running
in stress mode (no think time). In all tests involving more than 10
users, 10 users were started at test start, and the number of users
was ramped up at a rate of 10 users per minute.
Crashes produced during this test always had the same core image
file "appearance". Many threads were active at once (say 50 threads),
but only one had received the SIGILL which caused the process to shutdown.
Additionally, the stack frame for that thread had 24 HotSpot created
functions on the stack frame. (HotSpot created functions do not
have symbol entries in the symbol tables and as a result appear
"??" in GDB.)
Oddly enough, all memory around the instruction which caused the
SIGILL (and the instruction itself) are zerod out in the core file.
Is this a "feature" of the core dump facility on Solaris, a bug
in GDB or really what happened to the JVM? (i.e. did HotSpot or
the GC zero out memory it wasn't supposed to zero out?) I don't
think this is a GDB bug, as other core files created due to SIGSEGV
seem to have legal SPARC instructions at the reported PC.
Test 2:
What effect does single threading the server have by reconfiguring
Silk to only issue one request at a time per user? If this the
crash is thread related we should not see a crash in a single
threaded case.
One virtual user was configured to issue only one request at a time,
but with no think time between requests.
Result: 3,559 requests in 28 minutes. No errors.
Test 3:
If test #2 holds true, that single threading works, then running only
two threads may be able to reproduce the crash.
One virtual user was configured to issue two concurrent requests
at a time, with no think time between requests.
Result: crash around 5,000 requests after 10 minutes of load. Crash
appears to be the same as observed in test #1. (SIGILL delivered
with 24 stack frames of HotSpot created functions.)
Test 4:
According to the help for the "java" command, -Xbatch should prevent
HotSpot from replacing the code of a method at runtime. Since the
crash occurs only after processing several hundred requests successfully,
the crash must be caused by either replacement code generated by
HotSpot in the middle of the test or a periodic background cleanup
task triggering off. Since the single user case held, I'm leaning
towards the former case. I'm also thinking that perhaps HotSpot is
changing the code for a method while another thread is attempting to
execute it, and the crash is because the executing thread is seeing
the machine code in the middle of the change (an unstable state).
Result: Using -Xbatch just causes the JVM to suffer from many
NoClassDefFoundError exceptions in com.ibm.xml.parser.Token.getName.
Very odd error. I can only conclude that -Xbatch does not work in
this version of the JVM. This test was run many times with the
same result. The good news is the JVM does not SIGILL when using
-Xbatch, but I don't think it gets far enough to receive the SIGILL point.
Additionally, the stack trace created from the NoClassDefFoundError
is 34 methods deep. Unless HotSpot was able to inline 10 methods to
reduce the stack depth, this NoClassDefFoundError cannot be the problem
we are seeing in the other 3 tests. On top of this, absolutely no
request completes successfully with -Xbatch enabled, so this is really
not of any help.
Test 5:
I split the client servlet (SimpleMapTestPlus) into its own JVM,
seperate from the MapXtreme Java server servlet. Both ran on the same
machine underneath the same Apache server.
The result was the same as all other tests (except test #4). The
server JVM process died with a SIGILL and its stack trace (as reported
by GDB) shows 24 HotSpot functions on the call stack.
Test 6:
Paul Jossman suggested using MapXtreme Java 4.0 build 20 as the XML
parser has been switched away from IBM's XML parser to Apache's Xerces
parser. This test was run with a single virtual user allowed to
make 4 concurrent connections to MapXtreme Java. Both client and
server servlets were in the same JVM, and no think time was allowed
between requests.
The JVM made it through about 4,000 requests before it died. Its
death yielded the same SIGILL and 24 frame stack trace as every
other crash.
Test 7:
MapXtreme Java 4.0 build 20 was tested with -Xbatch enabled to see
if this cleared up the NoClassDefFoundError seen in Test 4.
After 186 successfully requests, the JVM died with a SIGSEGV (signal 11).
At least with the Xerces XML parser we do not see the ClassDefNotFoundError.
Examining the core file in gdb reveals that the thread which caught
the signal was executing in a JVM internal method:
int PhaseChaitin::strech_base_pointer_live_rangs(ResourceArea*)
Some of the classes calling this method seemed to refer to the runtime
HotSpot compiler. Perhaps the reason for the crash is an invalid
pointer dereference in the HotSpot compiler itself.
Test 8:
Under the assumption that test 7's error was a result of trying to
increase the size of the heap during runtime, I resized the initial
heap to be the same as the maximum. However, since the call stack
contained references to the "Compiler" object, I doubt this is the case.
The test ran successfully for an hour, completing 18,278 image requests
for one virtual user, no think time and 4 concurrent connections.
12 hours after the test completed (around 4:20 am) when there was no
load on the server the JVM randomly SIGSEGVd.
Conclusions:
It would seem as though this particular version of the JVM has some
errors in its HotSpot code generator (the runtime compiler). On Solaris,
after a period of time we see the 32 bit JVM crash with a SIGILL
having been delivered to the process. The SIGILL is always issued at
the same point in our software: 24 stack frames down on the runtime
stack. Each of these entries is a dynamicly created method, with only
the HotSpot error trapping code on one end of the stack frame and the
JVM thread root functions on the other. There appear to be 9 functions
assocaited with the Tomcat call stack, leaving the last 15 functions
to be (possibly) ones from MapXtreme Java's server servlet.
It would seem as though the crash occurs after handling about 533
requests. Typically the easiest way to cause the crash is to run
MapXtreme Java with 1 user stress test loading all 13 images in
the MapXtreme Java 3.1 test Shawn B. created. In this test, Silk
is running 4 concurrent connections per user to the server.
Perhaps the issue with multiple threads is that HotSpot has recreated
the instructions for the method, but has somehow screwed up in copying
the new instructions into the method's storage in memory. As a result,
some other thread calls into the new method before the new method is
truely ready for execution, and trys to execute a partially complete
instruction, or something which was not an instruction (but rather older
data laying in memory). Is the invalid instruction a null word or
something silly like that?
It would seem as though load has no bering on when this crash will
occur, as even one user (with no think time) can bring the JVM down with
this error. I now have 3 core dumps showing identical stack traces from
a thread dying with this SIGILL. Unfortunately, since HotSpot
generates the code on the fly, there is no symbol table associated
with the stack frames to uncover what method of MapXtreme Java is
causing the error. If we can identify the current method of the thread
that received the SIGILL signal, perhaps we can give Sun a test case
which can reproduce the error.
With newer releases of the 1.4 JVM, we have to wonder if this
particular HotSpot bug has been fixed, or if the bug is still present
but other changes to HotSpot's runtime compiler will cause the number
of stack frames seen and their alignment in memory to be shifted such
that it doesn't appear to be the same error.
All fingers point to MapXtreme Java as the software causing HotSpot
to generate illegal machine code, as the NullServlet test with Tomcat
did not suffer from this runtime problem. Since we can run several
hundred successful requests through the server before the crash, I am
lead to believe that either HotSpot recompiles the victim method
improperly as a performance improvement, or that the victim method
really is just called only infrequently by MapXtreme Java (and as it
happens is only called once every few hundred requests as a background
cleanup process for example).
Update:
After examining most of the core files from the tests, it would appear
as though the memory has been zeroed around the instruction which is
causing the illegal instruction. I examined the stacks of every
active thread visible in the core files, and it would appear as though
the memory was zeroed while the victim thread was sleeping. When it
woke up it died with a SIGILL. It is not known when the memory zeroing
occured - it may have occured while the thread was waiting for IO or a
system call (or AWT) call to complete, and then it called a HotSpot'd
method which had been zeroed over. Or it was preempted, and while
waiting for control one or more methods were zeroed behind its back.
This really looks like a race condition, and the method being zeroed
looks like its also a MapXtreme Java server side method.
- duplicates
-
JDK-4473094 64-Bit Server VM warning: Attempt to protect stack guard pages failed
-
- Closed
-
- relates to
-
JDK-4479689 VM does not defend against OutOfMemory errors on MDO creation
-
- Resolved
-