-
Bug
-
Resolution: Fixed
-
P3
-
1.4.2
-
1.5
-
sparc
-
solaris_8
-
Not verified
Customer is running their server under heavy load, 1000 concurrent HTTP
clients. The server is configured with 1700 client threads and is running
on a 4-way 700mHz Xeon box with 2GB RAM using Windows Win2K Advanced
Server and they have tested on Solaris 8 Sparc and see the same problem.
We have recreated the customer environment in our lab on a 4-way SunBlade
using Solaris 8. We install the Tomcat JSP examples into their web container
and use a load generator to create 2000 threads to ping the date.jsp. We ran
their web container with the 1.4.1 and 1.4.2 runtime and it seems the Server
compiler is hitting some type of deadlock under load.
Many of their worker threads on the server are stuck in a
java/io/PrintWriter.flush call.
When running the same test with the client compiler in 1.4.1 the problem
goes away. We have also worked around the problem by excluding the
PrinterWriter.flush method in a .hotspot_compiler file.
The pstacks are in the attachments. The 1.4.2 pstack displays most of the
symbols since we were able to run with a debug build that a custom pstack
is able to translate tthe compiled Java methods.
Here's an example thread showing the thread waiting in flush():
"client1500" daemon prio=5 tid=0x283D0EE0 nid=0x12a8 runnable
[400ef000..400efdc0]
at java.io.PrintWriter.flush(PrintWriter.java:121)
at com.sssw.srv.jsp.AgoJspWriter.flush(AgoJspWriter.java:198)
at com.sssw.srv.jsp.AgoJspPageContext.release(AgoJspPageContext.java:87)
at
com.sssw.srv.jsp.AgoJspFactory.releasePageContext(AgoJspFactory.java:59)
at
com.sssw.gen.jsp.jsp_include_include_jsp_1016613202._jspService(jsp_include_include_jsp_1016613202.java:104)
at com.sssw.srv.jsp.AgoHttpJspPage.service(AgoHttpJspPage.java:94)
at com.sssw.srv.busobj.AgoFilterChain.doServlet(AgoFilterChain.java:394)
at com.sssw.srv.busobj.AgoFilterChain.doFilter(AgoFilterChain.java:180)
at com.sssw.srv.resources.AgWarResource.service(AgWarResource.java:629)
at
com.sssw.srv.resources.AgWarURLResource.perform(AgWarURLResource.java:113)
at com.sssw.srv.http.httpd.perform(httpd.java:5568)
at com.sssw.srv.http.Client.processRequest(Client.java:905)
at com.sssw.srv.http.Client.loop(Client.java:1271)
at com.sssw.srv.http.Client.runConnection(Client.java:1487)
at com.sssw.srv.http.Client.run(Client.java:1435)
at java.lang.Thread.run(Thread.java:536)
If we let the test continue to run we will get hundreds (831 in my last
test) of threads sitting in this state. About half of these threads sitting
in the above state and about half of them doing a flush() from a different
path:
"client1499" daemon prio=5 tid=0x283D0288 nid=0xf34 runnable
[400af000..400afdc0]
at java.io.PrintWriter.flush(PrintWriter.java:121)
at
com.sssw.srv.busobj.AgoHttpReplyEvent.completeContent(AgoHttpReplyEvent.java:87)
at com.sssw.srv.resources.AgWarResource.service(AgWarResource.java:631)
at
com.sssw.srv.resources.AgWarURLResource.perform(AgWarURLResource.java:113)
at com.sssw.srv.http.httpd.perform(httpd.java:5568)
at com.sssw.srv.http.Client.processRequest(Client.java:905)
at com.sssw.srv.http.Client.loop(Client.java:1271)
at com.sssw.srv.http.Client.runConnection(Client.java:1487)
at com.sssw.srv.http.Client.run(Client.java:1435)
at java.lang.Thread.run(Thread.java:536)
Again, none of these PrintWriter objects are sharing the same 'lock'.
Looking through all the threads, I don't see any thread that holds the
lock.
Thousands and thousands of requests complete successfully, so not all
threads are getting hung up here, but more and more do as the test runs on.
If we do successive <ctrl> Break's, some of the threads that were waiting
in flush() in previous traces are not waiting in current ones. If I had to
guess what was happening, it looks like there is some scheduling bottleneck
coming out of the synchronized block on an SMP system.
This is really hurting our performance / scalability. We realize that
having this many threads is not a good design (we are fixing this in the
release we are working on), but it is surprising that this particular
synchronized block is causing problems while thousands of others aren't.
----------------------------------------------------------------------------
clients. The server is configured with 1700 client threads and is running
on a 4-way 700mHz Xeon box with 2GB RAM using Windows Win2K Advanced
Server and they have tested on Solaris 8 Sparc and see the same problem.
We have recreated the customer environment in our lab on a 4-way SunBlade
using Solaris 8. We install the Tomcat JSP examples into their web container
and use a load generator to create 2000 threads to ping the date.jsp. We ran
their web container with the 1.4.1 and 1.4.2 runtime and it seems the Server
compiler is hitting some type of deadlock under load.
Many of their worker threads on the server are stuck in a
java/io/PrintWriter.flush call.
When running the same test with the client compiler in 1.4.1 the problem
goes away. We have also worked around the problem by excluding the
PrinterWriter.flush method in a .hotspot_compiler file.
The pstacks are in the attachments. The 1.4.2 pstack displays most of the
symbols since we were able to run with a debug build that a custom pstack
is able to translate tthe compiled Java methods.
Here's an example thread showing the thread waiting in flush():
"client1500" daemon prio=5 tid=0x283D0EE0 nid=0x12a8 runnable
[400ef000..400efdc0]
at java.io.PrintWriter.flush(PrintWriter.java:121)
at com.sssw.srv.jsp.AgoJspWriter.flush(AgoJspWriter.java:198)
at com.sssw.srv.jsp.AgoJspPageContext.release(AgoJspPageContext.java:87)
at
com.sssw.srv.jsp.AgoJspFactory.releasePageContext(AgoJspFactory.java:59)
at
com.sssw.gen.jsp.jsp_include_include_jsp_1016613202._jspService(jsp_include_include_jsp_1016613202.java:104)
at com.sssw.srv.jsp.AgoHttpJspPage.service(AgoHttpJspPage.java:94)
at com.sssw.srv.busobj.AgoFilterChain.doServlet(AgoFilterChain.java:394)
at com.sssw.srv.busobj.AgoFilterChain.doFilter(AgoFilterChain.java:180)
at com.sssw.srv.resources.AgWarResource.service(AgWarResource.java:629)
at
com.sssw.srv.resources.AgWarURLResource.perform(AgWarURLResource.java:113)
at com.sssw.srv.http.httpd.perform(httpd.java:5568)
at com.sssw.srv.http.Client.processRequest(Client.java:905)
at com.sssw.srv.http.Client.loop(Client.java:1271)
at com.sssw.srv.http.Client.runConnection(Client.java:1487)
at com.sssw.srv.http.Client.run(Client.java:1435)
at java.lang.Thread.run(Thread.java:536)
If we let the test continue to run we will get hundreds (831 in my last
test) of threads sitting in this state. About half of these threads sitting
in the above state and about half of them doing a flush() from a different
path:
"client1499" daemon prio=5 tid=0x283D0288 nid=0xf34 runnable
[400af000..400afdc0]
at java.io.PrintWriter.flush(PrintWriter.java:121)
at
com.sssw.srv.busobj.AgoHttpReplyEvent.completeContent(AgoHttpReplyEvent.java:87)
at com.sssw.srv.resources.AgWarResource.service(AgWarResource.java:631)
at
com.sssw.srv.resources.AgWarURLResource.perform(AgWarURLResource.java:113)
at com.sssw.srv.http.httpd.perform(httpd.java:5568)
at com.sssw.srv.http.Client.processRequest(Client.java:905)
at com.sssw.srv.http.Client.loop(Client.java:1271)
at com.sssw.srv.http.Client.runConnection(Client.java:1487)
at com.sssw.srv.http.Client.run(Client.java:1435)
at java.lang.Thread.run(Thread.java:536)
Again, none of these PrintWriter objects are sharing the same 'lock'.
Looking through all the threads, I don't see any thread that holds the
lock.
Thousands and thousands of requests complete successfully, so not all
threads are getting hung up here, but more and more do as the test runs on.
If we do successive <ctrl> Break's, some of the threads that were waiting
in flush() in previous traces are not waiting in current ones. If I had to
guess what was happening, it looks like there is some scheduling bottleneck
coming out of the synchronized block on an SMP system.
This is really hurting our performance / scalability. We realize that
having this many threads is not a good design (we are fixing this in the
release we are working on), but it is surprising that this particular
synchronized block is causing problems while thousands of others aren't.
----------------------------------------------------------------------------