The customer is seeing this in their main application, running on 1.3.1_02 with BEA's WebLogic. But they've provided a nice test case which is attached, along with a bunch of cores and their matching hs_error files from my reproduction system using Tomcat and various J2SE versions, one of which looks like:
-------
Unexpected Signal : 11 occurred at PC=0xfb041c38
Function name=<init> (compiled Java code)
Library=(N/A)
Current Java thread:
Dynamic libraries:
0x10000 /usr/j2sdk1_3_1_02//bin/../bin/sparc/native_threads/java
0xff350000 /usr/lib/libthread.so.1
0xff390000 /usr/lib/libdl.so.1
0xff200000 /usr/lib/libc.so.1
0xff330000 /usr/platform/SUNW,Ultra-80/lib/libc_psr.so.1
0xfe480000 /usr/j2sdk1_3_1_02/jre/lib/sparc/client/libjvm.so
0xff2e0000 /usr/lib/libCrun.so.1
0xff1e0000 /usr/lib/libsocket.so.1
0xff100000 /usr/lib/libnsl.so.1
0xff0d0000 /usr/lib/libm.so.1
0xff310000 /usr/lib/libw.so.1
0xff0b0000 /usr/lib/libmp.so.2
0xff080000 /usr/j2sdk1_3_1_02/jre/lib/sparc/native_threads/libhpi.so
0xff050000 /usr/j2sdk1_3_1_02/jre/lib/sparc/libverify.so
0xfe440000 /usr/j2sdk1_3_1_02/jre/lib/sparc/libjava.so
0xff020000 /usr/j2sdk1_3_1_02/jre/lib/sparc/libzip.so
0xfd3d0000 /usr/j2sdk1_3_1_02/jre/lib/sparc/libnet.so
0xfafe0000 /usr/lib/nss_nisplus.so.1
0xfafc0000 /usr/lib/libdoor.so.1
Local Time = Wed Jan 30 17:03:33 2002
Elapsed Time = 3627
#
# HotSpot Virtual Machine Error : 11
# Error ID : 4F530E43505002C4 01
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Client VM (1.3.1_02-b02 mixed mode)
#
------
So far I've been unable to cause this to crash on a uniprocessor. Running on a quad processor U80, though, I start to see this issue, running their testcase on Tomcat. I'm stressing the servlet using our internal load generator as:
injectload -threads30 -requests100000 -think0 -host<tomcat server> -port8080 -file/servlet/TestLoadCompress\?chars=2000
And running this 2 or 3 times at a go to create a load of some 60-90 concurrent users. The chars=n arguement to the servlet is the control here. To quote the customer:
In my test case (calling a simple servlet which delivers static html) I
noticed the OOM is influenced by the size of the html. If it is below 2049
(say 2049) bytes the server runs for much a longer time than it does if the
size is above 2048 (say 2049). I've looked into SocketOutputStream.c (hotspot
jvm code) and noticed that 2048 ( MAX_BUFFER_LEN ) is the limit above which a
malloc is done instead of using stack allocated memory. As I understand the
code, a free is always done and hence there should not be a problem. It is
curious though that exactely that value changes the outcome of my test.
Using 1.3.1_02 with either -server or -client, it will start showing this error after a short while:
2002-01-30 04:39:33 - ContextManager: Error reading request, ignored - java.lang.NullPointerException
at org.apache.tomcat.service.http.HttpConnectionHandler.processConnection(HttpConnectionHandler.java:191)
at org.apache.tomcat.service.TcpWorkerThread.runIt(PoolTcpEndpoint.java:416)
at org.apache.tomcat.util.ThreadPool$ControlRunnable.run(ThreadPool.java:498)
at java.lang.Thread.run(Thread.java:484)
Or:
2002-01-31 01:40:11 - ContextManager: Error reading request, ignored - java.lang.NullPointerException
<<no stack trace available>>
getting more and more regular until it finally crashes. It's always with a SEGV for me, I haven't seen the OOMs the customer sees with Weblogic, but since the SEGVs seem to be in the GC subsystem, that might not be the difference.
With 1.4.0RC I don't get crashes, but after a lot longer running (like 10+ hours) I get a hang. I've taken a gcore of this, also in the attached file, and trussing it showed an awefiul lot of sleeping LWPs:
:
lwp_cond_wait(0x000E1418, 0x000E1400, 0x00000000) (sleeping...)
lwp_cond_wait(0x000E28B8, 0x000E28A0, 0x00000000) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_cond_wait(0x0002A400, 0x0002A3E8, 0x00000000) (sleeping...)
lwp_sema_wait(0xF1B81E30) (sleeping...)
lwp_cond_wait(0x0002AEE8, 0x0002AED0, 0x00000000) (sleeping...)
lwp_cond_wait(0x0002AF50, 0x0002AF38, 0x00000000) (sleeping...)
lwp_cond_wait(0x002155B8, 0x002155A0, 0x00000000) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_cond_wait(0x00259918, 0x00259900, 0x00000000) (sleeping...)
lwp_cond_wait(0x004E5A58, 0x004E5A40, 0x00000000) (sleeping...)
lwp_cond_wait(0x00258C00, 0x00258BE8, 0x00000000) (sleeping...)
:
for example, but little else. So it looks like memory is filling up until the JVM gets stuck trying to free it?
The customr feels this is all due to the finaliser method used in PlainSocketImpl, as all the previous problems he's had are due to the use of finalisation, and the finaliser thread not keeping up on multiprocessor systems.
His experience using Weblogic instead of Tomcat is:
my tests have failed in a few different ways, from just OOM and no core to SEGV to OOM and core. The way I interpret this is that as soon as you run low on memory a few different things can happen.
If a successful GC frees up enough memory the jvm can work for a little while after that but it will eventually die.
If some GC kiks in while the JVM is doing optimized compilation, chances are you get SEGV (handling of signals, I think they are SIGUSER, inside the JVM is not the best possible). I've got many of them after seeing OOMs, expecially if the OOM was thrown after a call to SocketOutputStream.
The more often GC kiks in (which is of course the case if you have OOMs) the more chances you have the JVM go nuts with signals (this is a know problem for the "-server" option as it was for the "-X:optimize" under 1.2.2_007).
In my opinion it is just a symptom (that should anyway be curated ). The real cause is that the JVM has to do to much GC because of its inefficiency.
-------
Unexpected Signal : 11 occurred at PC=0xfb041c38
Function name=<init> (compiled Java code)
Library=(N/A)
Current Java thread:
Dynamic libraries:
0x10000 /usr/j2sdk1_3_1_02//bin/../bin/sparc/native_threads/java
0xff350000 /usr/lib/libthread.so.1
0xff390000 /usr/lib/libdl.so.1
0xff200000 /usr/lib/libc.so.1
0xff330000 /usr/platform/SUNW,Ultra-80/lib/libc_psr.so.1
0xfe480000 /usr/j2sdk1_3_1_02/jre/lib/sparc/client/libjvm.so
0xff2e0000 /usr/lib/libCrun.so.1
0xff1e0000 /usr/lib/libsocket.so.1
0xff100000 /usr/lib/libnsl.so.1
0xff0d0000 /usr/lib/libm.so.1
0xff310000 /usr/lib/libw.so.1
0xff0b0000 /usr/lib/libmp.so.2
0xff080000 /usr/j2sdk1_3_1_02/jre/lib/sparc/native_threads/libhpi.so
0xff050000 /usr/j2sdk1_3_1_02/jre/lib/sparc/libverify.so
0xfe440000 /usr/j2sdk1_3_1_02/jre/lib/sparc/libjava.so
0xff020000 /usr/j2sdk1_3_1_02/jre/lib/sparc/libzip.so
0xfd3d0000 /usr/j2sdk1_3_1_02/jre/lib/sparc/libnet.so
0xfafe0000 /usr/lib/nss_nisplus.so.1
0xfafc0000 /usr/lib/libdoor.so.1
Local Time = Wed Jan 30 17:03:33 2002
Elapsed Time = 3627
#
# HotSpot Virtual Machine Error : 11
# Error ID : 4F530E43505002C4 01
# Please report this error at
# http://java.sun.com/cgi-bin/bugreport.cgi
#
# Java VM: Java HotSpot(TM) Client VM (1.3.1_02-b02 mixed mode)
#
------
So far I've been unable to cause this to crash on a uniprocessor. Running on a quad processor U80, though, I start to see this issue, running their testcase on Tomcat. I'm stressing the servlet using our internal load generator as:
injectload -threads30 -requests100000 -think0 -host<tomcat server> -port8080 -file/servlet/TestLoadCompress\?chars=2000
And running this 2 or 3 times at a go to create a load of some 60-90 concurrent users. The chars=n arguement to the servlet is the control here. To quote the customer:
In my test case (calling a simple servlet which delivers static html) I
noticed the OOM is influenced by the size of the html. If it is below 2049
(say 2049) bytes the server runs for much a longer time than it does if the
size is above 2048 (say 2049). I've looked into SocketOutputStream.c (hotspot
jvm code) and noticed that 2048 ( MAX_BUFFER_LEN ) is the limit above which a
malloc is done instead of using stack allocated memory. As I understand the
code, a free is always done and hence there should not be a problem. It is
curious though that exactely that value changes the outcome of my test.
Using 1.3.1_02 with either -server or -client, it will start showing this error after a short while:
2002-01-30 04:39:33 - ContextManager: Error reading request, ignored - java.lang.NullPointerException
at org.apache.tomcat.service.http.HttpConnectionHandler.processConnection(HttpConnectionHandler.java:191)
at org.apache.tomcat.service.TcpWorkerThread.runIt(PoolTcpEndpoint.java:416)
at org.apache.tomcat.util.ThreadPool$ControlRunnable.run(ThreadPool.java:498)
at java.lang.Thread.run(Thread.java:484)
Or:
2002-01-31 01:40:11 - ContextManager: Error reading request, ignored - java.lang.NullPointerException
<<no stack trace available>>
getting more and more regular until it finally crashes. It's always with a SEGV for me, I haven't seen the OOMs the customer sees with Weblogic, but since the SEGVs seem to be in the GC subsystem, that might not be the difference.
With 1.4.0RC I don't get crashes, but after a lot longer running (like 10+ hours) I get a hang. I've taken a gcore of this, also in the attached file, and trussing it showed an awefiul lot of sleeping LWPs:
:
lwp_cond_wait(0x000E1418, 0x000E1400, 0x00000000) (sleeping...)
lwp_cond_wait(0x000E28B8, 0x000E28A0, 0x00000000) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_cond_wait(0x0002A400, 0x0002A3E8, 0x00000000) (sleeping...)
lwp_sema_wait(0xF1B81E30) (sleeping...)
lwp_cond_wait(0x0002AEE8, 0x0002AED0, 0x00000000) (sleeping...)
lwp_cond_wait(0x0002AF50, 0x0002AF38, 0x00000000) (sleeping...)
lwp_cond_wait(0x002155B8, 0x002155A0, 0x00000000) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_mutex_lock(0x0002A488) (sleeping...)
lwp_cond_wait(0x00259918, 0x00259900, 0x00000000) (sleeping...)
lwp_cond_wait(0x004E5A58, 0x004E5A40, 0x00000000) (sleeping...)
lwp_cond_wait(0x00258C00, 0x00258BE8, 0x00000000) (sleeping...)
:
for example, but little else. So it looks like memory is filling up until the JVM gets stuck trying to free it?
The customr feels this is all due to the finaliser method used in PlainSocketImpl, as all the previous problems he's had are due to the use of finalisation, and the finaliser thread not keeping up on multiprocessor systems.
His experience using Weblogic instead of Tomcat is:
my tests have failed in a few different ways, from just OOM and no core to SEGV to OOM and core. The way I interpret this is that as soon as you run low on memory a few different things can happen.
If a successful GC frees up enough memory the jvm can work for a little while after that but it will eventually die.
If some GC kiks in while the JVM is doing optimized compilation, chances are you get SEGV (handling of signals, I think they are SIGUSER, inside the JVM is not the best possible). I've got many of them after seeing OOMs, expecially if the OOM was thrown after a call to SocketOutputStream.
The more often GC kiks in (which is of course the case if you have OOMs) the more chances you have the JVM go nuts with signals (this is a know problem for the "-server" option as it was for the "-X:optimize" under 1.2.2_007).
In my opinion it is just a symptom (that should anyway be curated ). The real cause is that the JVM has to do to much GC because of its inefficiency.
- relates to
-
JDK-4546307 libjvm 1.3.1 Scavenge func SIGSEGV causes Weblogics Apps Server 6 SP2 to Abort
-
- Closed
-
-
JDK-4525290 Thread Finalizers problem & OutOfMemoryError under jdk1.3
-
- Closed
-