Customer is running their server under heavy load, 1000 concurrent HTTP clients. The server is configured with 1700 client threads and is running
on a 4-way 700mHz PIII Xeon box with 2GB RAM using Windows Win2K Advanced
Server and they have tested on Solaris 8 Sparc and see the same problem
described below.
---------
After a few minutes of running we start to see (via <ctrl> Break) threads
that are "runnable" but waiting in java.io.PrintWriter.flush(). There is a
synchronized block in PrintWriter.flush(), but the PrintWriter each thread
has is a different instance. There should be no contention for 'lock'. Also
the stack traces show that we are waiting at line 121 (in 1.4.1_02), but
line 121 is the closing brace of the synchronized block.
Here's an example thread showing the thread waiting in flush():
"client1500" daemon prio=5 tid=0x283D0EE0 nid=0x12a8 runnable
[400ef000..400efdc0]
at java.io.PrintWriter.flush(PrintWriter.java:121)
at com.sssw.srv.jsp.AgoJspWriter.flush(AgoJspWriter.java:198)
at com.sssw.srv.jsp.AgoJspPageContext.release(AgoJspPageContext.java:87)
at
com.sssw.srv.jsp.AgoJspFactory.releasePageContext(AgoJspFactory.java:59)
at
com.sssw.gen.jsp.jsp_include_include_jsp_1016613202._jspService(jsp_include_include_jsp_1016613202.java:104)
at com.sssw.srv.jsp.AgoHttpJspPage.service(AgoHttpJspPage.java:94)
at com.sssw.srv.busobj.AgoFilterChain.doServlet(AgoFilterChain.java:394)
at com.sssw.srv.busobj.AgoFilterChain.doFilter(AgoFilterChain.java:180)
at com.sssw.srv.resources.AgWarResource.service(AgWarResource.java:629)
at
com.sssw.srv.resources.AgWarURLResource.perform(AgWarURLResource.java:113)
at com.sssw.srv.http.httpd.perform(httpd.java:5568)
at com.sssw.srv.http.Client.processRequest(Client.java:905)
at com.sssw.srv.http.Client.loop(Client.java:1271)
at com.sssw.srv.http.Client.runConnection(Client.java:1487)
at com.sssw.srv.http.Client.run(Client.java:1435)
at java.lang.Thread.run(Thread.java:536)
If we let the test continue to run we will get hundreds (831 in my last
test) of threads sitting in this state. About half of these threads sitting
in the above state and about half of them doing a flush() from a different
path:
"client1499" daemon prio=5 tid=0x283D0288 nid=0xf34 runnable
[400af000..400afdc0]
at java.io.PrintWriter.flush(PrintWriter.java:121)
at
com.sssw.srv.busobj.AgoHttpReplyEvent.completeContent(AgoHttpReplyEvent.java:87)
at com.sssw.srv.resources.AgWarResource.service(AgWarResource.java:631)
at
com.sssw.srv.resources.AgWarURLResource.perform(AgWarURLResource.java:113)
at com.sssw.srv.http.httpd.perform(httpd.java:5568)
at com.sssw.srv.http.Client.processRequest(Client.java:905)
at com.sssw.srv.http.Client.loop(Client.java:1271)
at com.sssw.srv.http.Client.runConnection(Client.java:1487)
at com.sssw.srv.http.Client.run(Client.java:1435)
at java.lang.Thread.run(Thread.java:536)
Again, none of these PrintWriter objects are sharing the same 'lock'.
Looking through all the threads, I don't see any thread that holds the
lock.
Thousands and thousands of requests complete successfully, so not all
threads are getting hung up here, but more and more do as the test runs on.
If we do successive <ctrl> Break's, some of the threads that were waiting
in flush() in previous traces are not waiting in current ones. If I had to
guess what was happening, it looks like there is some scheduling bottleneck
coming out of the synchronized block on an SMP system.
This is really hurting our performance / scalability. We realize that
having this many threads is not a good design (we are fixing this in the
release we are working on), but it is surprising that this particular
synchronized block is causing problems while thousands of others aren't.
###@###.### 2003-04-18
We brought the customer's server into the lab and ran a test.
This time the thread dumps listed many of the same hanging threads
in a condition wait status instead of runnable. From the pstack most
of these threads show the following:
----------------- lwp# 22 / thread# 21 --------------------
ff31f008 lwp_cond_wait (3f4160, 3f4148, 0)
ff31654c _lwp_cond_wait (3f4160, 3f4148, 0, ff186000, 0, 0) + c
fee698f8 __1cNObjectMonitorEwait6MxipnGThread__v_ (3f6078, 0, 5c00, 5d04, 87fe08, 55ec) + 61c
fee68c50 __1cSObjectSynchronizerEwait6FnGHandle_xpnGThread__v_ (e43d167c, 0, 0, 87fe08, e43d12b0, 4a4ae8) + 180
fee68e18 JVM_MonitorWait (87fe94, e43d1774, 0, 0, 87fe08, 0) + 130
fa00b92c ???????? (e43d1774, 0, 0, ffffffff, 0, 0)
fa005804 ???????? (e43d17f4, b6, 0, fa015e64, c, e43d1710)
fa005804 ???????? (e43d1884, b6, 0, fa015e30, 4, e43d1790)
fa005804 ???????? (e43d1904, 0, 0, fa015e30, 4, e43d1820)
fa00010c ???????? (e43d1994, e43d1c08, a, f66c6430, 4, e43d18a8)
fed681f8 __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (e43d1c00, e43d1a60, e43d1
b20, 87fe08, 87fe08, e43d1a74) + 25c
fee65f08 __1cJJavaCallsMcall_virtual6FpnJJavaValue_nLKlassHandle_nMsymbolHandle_4pnRJavaCallArguments_pnGThread__v_ (ff186000,
2f9c50, e43d1b14, e43d1b10, e43d1b20, 87fe08) + 190
fee78f1c __1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_ (e43d1c00, e43d1bfc, e4
3d1bf0, e43d1be8, e43d1be0, 87fe08) + 60
fee8b210 __1cMthread_entry6FpnKJavaThread_pnGThread__v_ (87fe08, 87fe08, 3f4330, 2f9c50, 301d38, fee841b0) + 1d8
fee841d8 __1cKJavaThreadDrun6M_v_ (87fe08, 0, 0, 0, 0, 0) + 254
fee80624 _start (87fe08, fe9ef688, 1, 1, fe9ee000, 0) + 20
fe9db728 _thread_start (87fe08, 0, 0, 0, 0, 0) + 40
The entire pstack is located in the attachments.
on a 4-way 700mHz PIII Xeon box with 2GB RAM using Windows Win2K Advanced
Server and they have tested on Solaris 8 Sparc and see the same problem
described below.
---------
After a few minutes of running we start to see (via <ctrl> Break) threads
that are "runnable" but waiting in java.io.PrintWriter.flush(). There is a
synchronized block in PrintWriter.flush(), but the PrintWriter each thread
has is a different instance. There should be no contention for 'lock'. Also
the stack traces show that we are waiting at line 121 (in 1.4.1_02), but
line 121 is the closing brace of the synchronized block.
Here's an example thread showing the thread waiting in flush():
"client1500" daemon prio=5 tid=0x283D0EE0 nid=0x12a8 runnable
[400ef000..400efdc0]
at java.io.PrintWriter.flush(PrintWriter.java:121)
at com.sssw.srv.jsp.AgoJspWriter.flush(AgoJspWriter.java:198)
at com.sssw.srv.jsp.AgoJspPageContext.release(AgoJspPageContext.java:87)
at
com.sssw.srv.jsp.AgoJspFactory.releasePageContext(AgoJspFactory.java:59)
at
com.sssw.gen.jsp.jsp_include_include_jsp_1016613202._jspService(jsp_include_include_jsp_1016613202.java:104)
at com.sssw.srv.jsp.AgoHttpJspPage.service(AgoHttpJspPage.java:94)
at com.sssw.srv.busobj.AgoFilterChain.doServlet(AgoFilterChain.java:394)
at com.sssw.srv.busobj.AgoFilterChain.doFilter(AgoFilterChain.java:180)
at com.sssw.srv.resources.AgWarResource.service(AgWarResource.java:629)
at
com.sssw.srv.resources.AgWarURLResource.perform(AgWarURLResource.java:113)
at com.sssw.srv.http.httpd.perform(httpd.java:5568)
at com.sssw.srv.http.Client.processRequest(Client.java:905)
at com.sssw.srv.http.Client.loop(Client.java:1271)
at com.sssw.srv.http.Client.runConnection(Client.java:1487)
at com.sssw.srv.http.Client.run(Client.java:1435)
at java.lang.Thread.run(Thread.java:536)
If we let the test continue to run we will get hundreds (831 in my last
test) of threads sitting in this state. About half of these threads sitting
in the above state and about half of them doing a flush() from a different
path:
"client1499" daemon prio=5 tid=0x283D0288 nid=0xf34 runnable
[400af000..400afdc0]
at java.io.PrintWriter.flush(PrintWriter.java:121)
at
com.sssw.srv.busobj.AgoHttpReplyEvent.completeContent(AgoHttpReplyEvent.java:87)
at com.sssw.srv.resources.AgWarResource.service(AgWarResource.java:631)
at
com.sssw.srv.resources.AgWarURLResource.perform(AgWarURLResource.java:113)
at com.sssw.srv.http.httpd.perform(httpd.java:5568)
at com.sssw.srv.http.Client.processRequest(Client.java:905)
at com.sssw.srv.http.Client.loop(Client.java:1271)
at com.sssw.srv.http.Client.runConnection(Client.java:1487)
at com.sssw.srv.http.Client.run(Client.java:1435)
at java.lang.Thread.run(Thread.java:536)
Again, none of these PrintWriter objects are sharing the same 'lock'.
Looking through all the threads, I don't see any thread that holds the
lock.
Thousands and thousands of requests complete successfully, so not all
threads are getting hung up here, but more and more do as the test runs on.
If we do successive <ctrl> Break's, some of the threads that were waiting
in flush() in previous traces are not waiting in current ones. If I had to
guess what was happening, it looks like there is some scheduling bottleneck
coming out of the synchronized block on an SMP system.
This is really hurting our performance / scalability. We realize that
having this many threads is not a good design (we are fixing this in the
release we are working on), but it is surprising that this particular
synchronized block is causing problems while thousands of others aren't.
###@###.### 2003-04-18
We brought the customer's server into the lab and ran a test.
This time the thread dumps listed many of the same hanging threads
in a condition wait status instead of runnable. From the pstack most
of these threads show the following:
----------------- lwp# 22 / thread# 21 --------------------
ff31f008 lwp_cond_wait (3f4160, 3f4148, 0)
ff31654c _lwp_cond_wait (3f4160, 3f4148, 0, ff186000, 0, 0) + c
fee698f8 __1cNObjectMonitorEwait6MxipnGThread__v_ (3f6078, 0, 5c00, 5d04, 87fe08, 55ec) + 61c
fee68c50 __1cSObjectSynchronizerEwait6FnGHandle_xpnGThread__v_ (e43d167c, 0, 0, 87fe08, e43d12b0, 4a4ae8) + 180
fee68e18 JVM_MonitorWait (87fe94, e43d1774, 0, 0, 87fe08, 0) + 130
fa00b92c ???????? (e43d1774, 0, 0, ffffffff, 0, 0)
fa005804 ???????? (e43d17f4, b6, 0, fa015e64, c, e43d1710)
fa005804 ???????? (e43d1884, b6, 0, fa015e30, 4, e43d1790)
fa005804 ???????? (e43d1904, 0, 0, fa015e30, 4, e43d1820)
fa00010c ???????? (e43d1994, e43d1c08, a, f66c6430, 4, e43d18a8)
fed681f8 __1cJJavaCallsLcall_helper6FpnJJavaValue_pnMmethodHandle_pnRJavaCallArguments_pnGThread__v_ (e43d1c00, e43d1a60, e43d1
b20, 87fe08, 87fe08, e43d1a74) + 25c
fee65f08 __1cJJavaCallsMcall_virtual6FpnJJavaValue_nLKlassHandle_nMsymbolHandle_4pnRJavaCallArguments_pnGThread__v_ (ff186000,
2f9c50, e43d1b14, e43d1b10, e43d1b20, 87fe08) + 190
fee78f1c __1cJJavaCallsMcall_virtual6FpnJJavaValue_nGHandle_nLKlassHandle_nMsymbolHandle_5pnGThread__v_ (e43d1c00, e43d1bfc, e4
3d1bf0, e43d1be8, e43d1be0, 87fe08) + 60
fee8b210 __1cMthread_entry6FpnKJavaThread_pnGThread__v_ (87fe08, 87fe08, 3f4330, 2f9c50, 301d38, fee841b0) + 1d8
fee841d8 __1cKJavaThreadDrun6M_v_ (87fe08, 0, 0, 0, 0, 0) + 254
fee80624 _start (87fe08, fe9ef688, 1, 1, fe9ee000, 0) + 20
fe9db728 _thread_start (87fe08, 0, 0, 0, 0, 0) + 40
The entire pstack is located in the attachments.