ocated @ javaserv2:/export/root/JavaDesktop/kona.1.0.kI_1.Z
Ctrace file:../traces/java-6.ctrace_1.0.kI.Aug12.10:15:31
ctrace
PC: ffd4077c
Last leaf: jmpl f81f5060 from f801e4e0
0 w %o0-%o5: ( f81f5060 ffeff000 0 71e1000c 80000000 0 )
call f801e4d0 from f8026294
1 w %o0-%o5: ( 0 f8602000 f8045440 f801c6a8 73 0 )
call f802610c from f801f7c8
2 w %o0-%o5: ( f8045444 f8045440 d f801c6a8 73 0 )
call f801de1c from f801da38
3 w %o0-%o5: ( 0 e02e9d58 18 7efefeff 81010100 ff0000 )
call f804541c from f800c460
4 w %o0-%o5: ( e03ac930 f81ef9a8 e03ac908 f801c6a8 690066 0 )
jmpl f800c444 from f8017ee4
5 w %o0-%o5: ( e03ac920 0 19 7efefeff 81010100 ff00 )
call f8017ecc from f800c4b4
6 w %o0-%o5: ( f800c444 0 c 7efefeff 81010100 ff0000 )
More
[<space>,<cr>,q] ?
call f800c470 from f801c278
7 w %o0-%o5: ( 21 1 f8602000 f8601800 81010100 ff0000 )
call f801c210 from f80262a0
8 w %o0-%o5: ( 1 f8602000 f8045cac 0 0 0 )
call f802610c from f801f7c8
9 w %o0-%o5: ( 9 f8045cac e02f80a0 f863c800 0 e02ea128 )
call f80457fc from f8045ca0
a w %o0-%o5: ( 0 f80ba454 f81f1730 f81f16e0 0 0 )
call f8045c6c from f8019728
b w %o0-%o5: ( 2 e0002bd8 f81f1730 f81f16e0 e0002bd8 e0030d10 )
jmpl f80195e4 from f8045b1c
c w %o0-%o5: ( f81f1400 f8602400 0 1388 e02f1564 f816ca10 )
???? from f8045ad8
d w %o0-%o5: ( 1 e0002bd8 f80195e4 0 0 0 )
ok
Trace file: ../traces/uniq/java-6.trace_1.0.kI.Aug12.10:15:31
jmpl %o0, %o7
call kona_enter_mon
call trap
call sysWrite
call sysMonitorDumpInfo
jmpl %i0, %o7
call registeredEnumerate_unlocked
call DumpMonitors_unlocked
call abort
call trap
call _sched_lock
call sysThreadExit
jmpl %i2, %o7
save %sp, -0x70, %sp
Snmp Dump: Not available
The description field as copied from bug report 1262836 follows:
This is not reproducable, but I had two windows, in oneI had the home page, in the
other I had the thread monitor, when I tried to run the "Show All Pages"
applet and got the following crash:
: laoshu; less tip.960808
...
ProcStackSize = 24576
malloc_chunk = 147480
initial_malloc_chunk = 1179840
Objects: 3414032 bytes, Handles: 426760 bytes, GCbits: 120028 bytes
Hullo from sun.kona.Main.main!
Env.buildNamespaces(): Building Namespaces!
Env.buildNamespaces(): Built Namespaces!
Kona Version: 1.0.kI
...
sysUnmapMem 0xe0304570
sysUnmapMem 0xe0304570
Fault 0x9 at PC 0xf8045f20
AFSR 1860000
AFAR 71d10000
MFSR 18000b0
MFAR 24231ac
SFSR 326
SFAR 10
Aborting...
I have the tip log and binary of kona in ~madany/BUGS/*.960808
-peter madany 8/8/96
I am entering this bug as something I hope we can solve mostly by insipration
and not persperation. FTP has encountered one of ugliest bugs I have
ever seen. They have a multi-threaded application an agent manager. It
uses at times over a hundred threads. It has run remarkably well. However,
during stress testing of the application, the Virtual machine will crash.
The circumstances are brutal. It runs anywhere from 1-10 hours and then
the crash occurs....sometimes. I have brought their application in house
and have reproduced the problem. Unfortunately, their application is a
very complicated to run and it took about a day a half to get things set up.
Attempts to narrow this down to a small test case have not been fruitful.
Through some trial and error ftp and I were able to get the time
between crashes up to 40 hours. This allowed them to ship so the
urgency is not as great. However, the bug still exists. I have
included several stack traces from the point of failure. In addition,
I have included some notes Tim Lindholm sent me after having a hallway
conversation after a look at the code. After talking with Tim, I have
some code changes I am going to make and a possible work around for the
customer. I creating this rambling note and placing it in bugtraq so
we will have a record of the problem in case anybody stumble on it.
Tim will have to confirm that these thoughts are still relevant after
looking at the included stack traces....
To: ###@###.###, thomas.kincaid@East
Cc: ###@###.###, Jim.Graham@Eng
Subject: Notes on FTP crash
Date: Wed, 17 Jul 1996 19:32:49 -0700
Hi Tom, here are a few half-baked notes mainly to remind me about a
possible cause of the bug. If you report the bug in the customer's name
and put me on as its owner, I'll add these details to the evaluation.
In the Green threads implementation of sysThreadExit() there is the
code:
...
removeFromActiveQ(tid);
/*
* WARNING: DO NOT ATTEMPT TO ENTER ANY MONITORS AFTER THIS POINT
* (unless you are reentering something you are *guaranteed* to
* already own).
*/
SCHED_LOCK();
...
thread_handle = (TID) tid->cookie;
ee = (ExecEnv *) THREAD(thread_handle)->eetop; <<< site of crash?
...
The call to removeFromActiveQ removes an exiting thread from the active
thread queue (whatever that is called). The active thread queue is
known about specially by the garbage collector, and ensures that a
thread object will not go away whether or not a user drops the last
reference to it, at least until the thread is stopped or exits on its
own.
I think there is a race condition at least between the removeFromActiveQ
call and the SCHED_LOCK. If this thread's exit is interrupted at that
point and a garbage collection happens, the exiting thread may be
collected: the user may have dropped all references, it's not on the
active thread queue, and the thread_handle variable on the C stack
won't have been set yet. I think that the same race exists on the
PC and on native Solaris threads, and is somewhat more likely to be hit
there given their's time-slicing.
The customer could work around the bug by keeping references to threads
in Java until Thread.isAlive fails (meaning sysThreadExit has got past
the race). To fix the bug we might be able to lock the thread queue
around removal and setting of thread_handle, or reorder setting of
thread_handle and removal from the queue.
-- Tim
Stack traces
------------------------------------
With -debug set:
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) where
[1] kill(0x0, 0x6, 0x0, 0xef691448, 0x18, 0xef638ed4), at 0xef6740d8
[2] abort(0xf5d22f34, 0xf0279b08, 0x400100, 0x0, 0x0, 0x0), at 0xef63a530
[3] sysAbort(), line 164 in "system_md.c"
=>[4] scanThread(t = 0xedba0de0, arg = 0xee308888), line 1048 in "gc.c"
[5] sysThreadEnumerateOver(func = &`java_g`gc.c`scanThread(struct sys_thread *t, void *arg) at 0x3d760, arg = 0xee308888), line 606 in "threads_md.c"
[6] scanThreads(), line 1177 in "gc.c"
[7] gc0_locked(async_call = 0, free_space_goal = 1256U), line 1767 in "gc.c"
[8] gc0(async_call = 0, free_space_goal = 1256U), line 1852 in "gc.c"
[9] manageAllocFailure(n = 1256, overflow_type = 1, overflow_act = 1), line 840 in "gc.c"
[10] realObjAlloc(mptr = 0x4de5, n0 = 1246), line 296 in "gc.c"
[11] ArrayAlloc(t = 5, l = 623), line 429 in "gc.c"
[12] ExecuteJava(initial_pc = 0xede70086 "\\x30000057", ee = 0xede70ccc), line 2579 in "interpreter.c"
[13] do_execute_java_method_vararg(ee = 0xede70ccc, obj = 0xee3120e8, method_name = 0xba7f3 "<init>", method_signature = 0xbaa63 "([C)V", mb = 0xbad88, isStaticCall = FALSE, args = 0xede702b0, otherBits = (nil), shortFloats = FALSE), line 529 in "interpreter.c"
[14] execute_java_constructor(ee = 0xede70ccc, classname = 0xba8e5 "java/lang/String", cb = 0xb42b0, signature = 0xa8194 "([C)", ...), line 345 in "interpreter.c"
[15] makeJavaString(str = 0x2e2eb8 " Receiving agent...\\nLauncher is busy\\nmessage:CyberAgent Manager,log saved,###@###.###,Jun 22 18:20:27 EDT 1996, \\n Agent [SystemAgent] sent\\n Agent [SystemAgent] received and accepted\\nSTART SAVING AT 1\\nNumb of outgoing agents = 1\\nsending: sampleAgent null to ###@###.###,\\n Sending agent [sampleAgent]\\nNumber of Received entries = 1\\nnumber of AgentCommInterface = 1\\nAgent Thread: SystemAgent\\nAgent Type: report\\nNumb of running agents = 1\\nNumber of log entries = 1\\nDelivering report" ..., len = 623), line 40 in "string.c"
[16] sun_awt_motif_MTextAreaPeer_getText(this = 0xee300328), line 347 in "awt_TextArea.c"
[17] Java_sun_awt_motif_MTextAreaPeer_getText_stub(_P_ = 0x1f7730, _EE_ = 0xede70ccc), line 882 in "libstubs.c"
[18] invokeNativeMethod(o = 0xee300328, mb = 0x1eb3a8, args_size = 1, ee = 0xede70ccc), line 194 in "classruntime.c"
[19] ExecuteJava(initial_pc = 0xede70ba6 "\\x3000005a", ee = 0xede70ccc), line 2434 in "interpreter.c"
[20] do_execute_java_method_vararg(ee = 0xede70ccc, obj = 0xee308888, method_name = 0xa827c "run", method_signature = 0xa8280 "()V", mb = (nil), isStaticCall = FALSE, args = 0xede70cac, otherBits = (nil), shortFloats = FALSE), line 529 in "interpreter.c"
[21] execute_java_dynamic_method(ee = 0xede70ccc, obj = 0xee308888, method_name = 0xa827c "run", signature = 0xa8280 "()V", ...), line 224 in "interpreter.c"
[22] ThreadRT0(p = 0xee308888), line 61 in "thread.c"
[23] start_func(dummy = 1, arg = -298809208, func = &`java_g`thread.c`ThreadRT0(Hjava_lang_Thread *p) at 0x65500), line 195 in "threads_md.c"
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) list
1058 }
1059 #endif /* TRACEGC */
1060 /* Mark thread object */
1061 MarkPtr(p, HardMark);
1062
1063 base = threadStackBase((TID)p);
1064 ssc = threadStackPointer((TID)p);
1065
1066 if (p == me) {
1067 ssc = (unsigned char **) &me;
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) where
[1] kill(0x0, 0x6, 0x0, 0xef691448, 0x18, 0xef638ed4), at 0xef6740d8
[2] abort(0xf5d22f34, 0xf0279b08, 0x400100, 0x0, 0x0, 0x0), at 0xef63a530
[3] sysAbort(), line 164 in "system_md.c"
=>[4] scanThread(t = 0xedba0de0, arg = 0xee308888), line 1048 in "gc.c"
[5] sysThreadEnumerateOver(func = &`java_g`gc.c`scanThread(struct sys_thread *t, void *arg) at 0x3d760, arg = 0xee308888), line 606 in "threads_md.c"
[6] scanThreads(), line 1177 in "gc.c"
[7] gc0_locked(async_call = 0, free_space_goal = 1256U), line 1767 in "gc.c"
[8] gc0(async_call = 0, free_space_goal = 1256U), line 1852 in "gc.c"
[9] manageAllocFailure(n = 1256, overflow_type = 1, overflow_act = 1), line 840 in "gc.c"
[10] realObjAlloc(mptr = 0x4de5, n0 = 1246), line 296 in "gc.c"
[11] ArrayAlloc(t = 5, l = 623), line 429 in "gc.c"
[12] ExecuteJava(initial_pc = 0xede70086 "\\x30000057", ee = 0xede70ccc), line 2579 in "interpreter.c"
[13] do_execute_java_method_vararg(ee = 0xede70ccc, obj = 0xee3120e8, method_name = 0xba7f3 "<init>", method_signature = 0xbaa63 "([C)V", mb = 0xbad88, isStaticCall = FALSE, args = 0xede702b0, otherBits = (nil), shortFloats = FALSE), line 529 in "interpreter.c"
[14] execute_java_constructor(ee = 0xede70ccc, classname = 0xba8e5 "java/lang/String", cb = 0xb42b0, signature = 0xa8194 "([C)", ...), line 345 in "interpreter.c"
[15] makeJavaString(str = 0x2e2eb8 " Receiving agent...\\nLauncher is busy\\nmessage:CyberAgent Manager,log saved,###@###.###,Jun 22 18:20:27 EDT 1996, \\n Agent [SystemAgent] sent\\n Agent [SystemAgent] received and accepted\\nSTART SAVING AT 1\\nNumb of outgoing agents = 1\\nsending: sampleAgent null to ###@###.###,\\n Sending agent [sampleAgent]\\nNumber of Received entries = 1\\nnumber of AgentCommInterface = 1\\nAgent Thread: SystemAgent\\nAgent Type: report\\nNumb of running agents = 1\\nNumber of log entries = 1\\nDelivering report" ..., len = 623), line 40 in "string.c"
[16] sun_awt_motif_MTextAreaPeer_getText(this = 0xee300328), line 347 in "awt_TextArea.c"
[17] Java_sun_awt_motif_MTextAreaPeer_getText_stub(_P_ = 0x1f7730, _EE_ = 0xede70ccc), line 882 in "libstubs.c"
[18] invokeNativeMethod(o = 0xee300328, mb = 0x1eb3a8, args_size = 1, ee = 0xede70ccc), line 194 in "classruntime.c"
[19] ExecuteJava(initial_pc = 0xede70ba6 "\\x3000005a", ee = 0xede70ccc), line 2434 in "interpreter.c"
[20] do_execute_java_method_vararg(ee = 0xede70ccc, obj = 0xee308888, method_name = 0xa827c "run", method_signature = 0xa8280 "()V", mb = (nil), isStaticCall = FALSE, args = 0xede70cac, otherBits = (nil), shortFloats = FALSE), line 529 in "interpreter.c"
[21] execute_java_dynamic_method(ee = 0xede70ccc, obj = 0xee308888, method_name = 0xa827c "run", signature = 0xa8280 "()V", ...), line 224 in "interpreter.c"
[22] ThreadRT0(p = 0xee308888), line 61 in "thread.c"
[23] start_func(dummy = 1, arg = -298809208, func = &`java_g`thread.c`ThreadRT0(Hjava_lang_Thread *p) at 0x65500), line 195 in "threads_md.c"
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) f 4
f: not found
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) frame 4
Current function is scanThread
1048 sysAbort();
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) list 1040,1050
1040 JHandle *me = (JHandle *) arg;
1041 JHandle *p = (JHandle *) sysThreadGetBackPtr(t);
1042 unsigned char **ssc, **limit;
1043 stackp_t base;
1044 SetLimits();
1045
1046 #ifdef DEBUG
1047 if (!ValidHandle(p)) {
1048 sysAbort();
1049 }
1050 #endif /* DEBUG */
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) c/bin/dbx) where
> =>[1] sysThreadExit(), line 322 in "threads_md.c"
> [2] ThreadRT0(p = 0xee317ec0), line 91 in "thread.c"
> [3] start_func(dummy = 1, arg = -298746176, func = &`java_g`thread.c`ThreadRT0(Hjava_lang_Thread *p) at 0x65500), line 195 in "threads_md.c"
>
The description field as copied from bug report 1263002 follows:
Peter was logged in to my 8MB MrCoffee and trying to get JavaOke
up and running when the following crash happened. The JavaOke
window had appeared, and we tried to select some music, but nothing
was responding.
The TCP RST output indicates that the JavaOke server was not running
and that our attempts at connecting to the JavaOke server were being
rejected.
Here's the dump:
sysUnmapMem 0xe02865d0
sysUnmapMem 0xe0209ff8
sysUnmapMem 0xe02135b0
TCP: SYN_SENT got RST -- connection refused
Fault 0x9 at PC 0xf8048264
AFSR 1860000
AFAR 71d00004
MFSR 18000b0
MFAR 4201bc
SFSR 326
SFAR 10
Aborting...
Full thread dump:
"../../../../src/kona/java/green_threads/src/schedule.c", line 68: assertion _scheduling_lock == 1 failure
*** panic: "../../../../src/kona/java/green_threads/src/schedule.c", line 68: assertion failure
Double abort!
Kona Postmortem: Initialized ethernet 8:0:20:7c:ba:df
JDM1Platform
sritchie:08/05/96-16:59
SCCS/s.GNUmakefile: D 1.193 96/07/30 11:30:29 rberlin 297 296 00003/00003/00815
SCCS/s.Classfiles.gmk: D 1.144 96/07/31 15:12:45 sritchie 250 249 00000/00000/00839
Type 'go' to resume
Type help for more information
ok
PC: ffd4077c
Last leaf: jmpl f82541e0 from f801fcec
0 w %o0-%o5: ( f82541e0 ffeff000 0 71e1000c 80000000 0 )
call f8022ccc from f8023098
1 w %o0-%o5: ( f802307c 10 240 ffd37a9c ffffffff 3d486 )
call f8022d90 from f8023284
2 w %o0-%o5: ( 2a f8626380 240 a2 20 7c )
call f80230d8 from f801d968
3 w %o0-%o5: ( 0 2 f8602000 f8601800 73 0 )
call f801d8e4 from f8025b18
4 w %o0-%o5: ( 62 e01fc228 62 7efefeff 81010100 ff00 )
call f8025b14 from f801c09c
5 w %o0-%o5: ( 0 e01fc698 f8252e5a e01fc7f8 81010100 ff00 )
call f801c040 from f801d9d4
6 w %o0-%o5: ( f8252e38 e01fc698 44 0 0 e01fcbb8 )
call f801d978 from f8047a88
7 w %o0-%o5: ( f8350448 f8350410 44 b 0 0 )
call f8047a38 from f8046e58
8 w %o0-%o5: ( 0 f8046e48 0 f81f6200 e000f840 e0010f48 )
call f8046dac from f80470ec
9 w %o0-%o5: ( e03a4af0 e03a4afc d f801dd7c 800 11 )
call f80470c0 from f804888c
a w %o0-%o5: ( e03a4af0 e01fc8c8 13 7efefeff 81010100 ff00 )
call f8048878 from f800d210
b w %o0-%o5: ( 0 0 c 7efefeff 81010100 ff0000 )
call f800d1f0 from f801d944
c w %o0-%o5: ( 2f 1 f8602000 f8601800 81010100 ff0000 )
call f801d8e4 from f8027c9c
d w %o0-%o5: ( 1 f8602000 f8048264 0 0 0 )
call f8027b08 from f8021014
e w %o0-%o5: ( 9 f8048264 e020ff58 f863c400 0 e01fcbb8 )
call f8047d40 from f8048258
f w %o0-%o5: ( 0 f80a7f38 f81f6250 f81f6200 0 0 )
call f8048224 from f801ad84
10 w %o0-%o5: ( f81f6000 e000f840 f81f6250 f81f6200 e000f840 e0010f48 )
jmpl f801ac10 from f80480d4
11 w %o0-%o5: ( f81f6000 f8602400 0 c8 e02ddecc f816fda0 )
???? from f8048090
12 w %o0-%o5: ( 1 e000f840 f801ac10 0 0 0 )
XXXXXXX from 11
13 w %o0-%o5: ( e0042460 0 e0032e70 11 7 264 )
XXXXXXX from a2
14 w %o0-%o5: ( 0 f8086560 0 f8086560 e00a75d8 f80717a0 )
The full crash log is in ~sritchie/crashes/crash6. The kona binary is
~sritchie/crashes/kona.crash6.
sritchie -- 9 Aug 96
The description field as copied from bug report 1263236 follows:
Reproduability: Hard to reproduce as the applet http://physics.www.media.mit.edu/~metois/MyJava/WebMusic/choir.html
ran for 64 hours and then crashed
Kona Version: 1.0.kI
Kona Builder and Date: SMCC:08/09/96-04:05
Kona GNUMakefile SCCS: 1.196
Kona Classfiles.gmk SCCS: 1.147
Kona Platform is : JavaDesktopMachine
Kona image is l
Ctrace file:../traces/java-6.ctrace_1.0.kI.Aug12.10:15:31
ctrace
PC: ffd4077c
Last leaf: jmpl f81f5060 from f801e4e0
0 w %o0-%o5: ( f81f5060 ffeff000 0 71e1000c 80000000 0 )
call f801e4d0 from f8026294
1 w %o0-%o5: ( 0 f8602000 f8045440 f801c6a8 73 0 )
call f802610c from f801f7c8
2 w %o0-%o5: ( f8045444 f8045440 d f801c6a8 73 0 )
call f801de1c from f801da38
3 w %o0-%o5: ( 0 e02e9d58 18 7efefeff 81010100 ff0000 )
call f804541c from f800c460
4 w %o0-%o5: ( e03ac930 f81ef9a8 e03ac908 f801c6a8 690066 0 )
jmpl f800c444 from f8017ee4
5 w %o0-%o5: ( e03ac920 0 19 7efefeff 81010100 ff00 )
call f8017ecc from f800c4b4
6 w %o0-%o5: ( f800c444 0 c 7efefeff 81010100 ff0000 )
More
[<space>,<cr>,q] ?
call f800c470 from f801c278
7 w %o0-%o5: ( 21 1 f8602000 f8601800 81010100 ff0000 )
call f801c210 from f80262a0
8 w %o0-%o5: ( 1 f8602000 f8045cac 0 0 0 )
call f802610c from f801f7c8
9 w %o0-%o5: ( 9 f8045cac e02f80a0 f863c800 0 e02ea128 )
call f80457fc from f8045ca0
a w %o0-%o5: ( 0 f80ba454 f81f1730 f81f16e0 0 0 )
call f8045c6c from f8019728
b w %o0-%o5: ( 2 e0002bd8 f81f1730 f81f16e0 e0002bd8 e0030d10 )
jmpl f80195e4 from f8045b1c
c w %o0-%o5: ( f81f1400 f8602400 0 1388 e02f1564 f816ca10 )
???? from f8045ad8
d w %o0-%o5: ( 1 e0002bd8 f80195e4 0 0 0 )
ok
Trace file: ../traces/uniq/java-6.trace_1.0.kI.Aug12.10:15:31
jmpl %o0, %o7
call kona_enter_mon
call trap
call sysWrite
call sysMonitorDumpInfo
jmpl %i0, %o7
call registeredEnumerate_unlocked
call DumpMonitors_unlocked
call abort
call trap
call _sched_lock
call sysThreadExit
jmpl %i2, %o7
save %sp, -0x70, %sp
Snmp Dump: Not available
The description field as copied from bug report 1262836 follows:
This is not reproducable, but I had two windows, in oneI had the home page, in the
other I had the thread monitor, when I tried to run the "Show All Pages"
applet and got the following crash:
: laoshu; less tip.960808
...
ProcStackSize = 24576
malloc_chunk = 147480
initial_malloc_chunk = 1179840
Objects: 3414032 bytes, Handles: 426760 bytes, GCbits: 120028 bytes
Hullo from sun.kona.Main.main!
Env.buildNamespaces(): Building Namespaces!
Env.buildNamespaces(): Built Namespaces!
Kona Version: 1.0.kI
...
sysUnmapMem 0xe0304570
sysUnmapMem 0xe0304570
Fault 0x9 at PC 0xf8045f20
AFSR 1860000
AFAR 71d10000
MFSR 18000b0
MFAR 24231ac
SFSR 326
SFAR 10
Aborting...
I have the tip log and binary of kona in ~madany/BUGS/*.960808
-peter madany 8/8/96
I am entering this bug as something I hope we can solve mostly by insipration
and not persperation. FTP has encountered one of ugliest bugs I have
ever seen. They have a multi-threaded application an agent manager. It
uses at times over a hundred threads. It has run remarkably well. However,
during stress testing of the application, the Virtual machine will crash.
The circumstances are brutal. It runs anywhere from 1-10 hours and then
the crash occurs....sometimes. I have brought their application in house
and have reproduced the problem. Unfortunately, their application is a
very complicated to run and it took about a day a half to get things set up.
Attempts to narrow this down to a small test case have not been fruitful.
Through some trial and error ftp and I were able to get the time
between crashes up to 40 hours. This allowed them to ship so the
urgency is not as great. However, the bug still exists. I have
included several stack traces from the point of failure. In addition,
I have included some notes Tim Lindholm sent me after having a hallway
conversation after a look at the code. After talking with Tim, I have
some code changes I am going to make and a possible work around for the
customer. I creating this rambling note and placing it in bugtraq so
we will have a record of the problem in case anybody stumble on it.
Tim will have to confirm that these thoughts are still relevant after
looking at the included stack traces....
To: ###@###.###, thomas.kincaid@East
Cc: ###@###.###, Jim.Graham@Eng
Subject: Notes on FTP crash
Date: Wed, 17 Jul 1996 19:32:49 -0700
Hi Tom, here are a few half-baked notes mainly to remind me about a
possible cause of the bug. If you report the bug in the customer's name
and put me on as its owner, I'll add these details to the evaluation.
In the Green threads implementation of sysThreadExit() there is the
code:
...
removeFromActiveQ(tid);
/*
* WARNING: DO NOT ATTEMPT TO ENTER ANY MONITORS AFTER THIS POINT
* (unless you are reentering something you are *guaranteed* to
* already own).
*/
SCHED_LOCK();
...
thread_handle = (TID) tid->cookie;
ee = (ExecEnv *) THREAD(thread_handle)->eetop; <<< site of crash?
...
The call to removeFromActiveQ removes an exiting thread from the active
thread queue (whatever that is called). The active thread queue is
known about specially by the garbage collector, and ensures that a
thread object will not go away whether or not a user drops the last
reference to it, at least until the thread is stopped or exits on its
own.
I think there is a race condition at least between the removeFromActiveQ
call and the SCHED_LOCK. If this thread's exit is interrupted at that
point and a garbage collection happens, the exiting thread may be
collected: the user may have dropped all references, it's not on the
active thread queue, and the thread_handle variable on the C stack
won't have been set yet. I think that the same race exists on the
PC and on native Solaris threads, and is somewhat more likely to be hit
there given their's time-slicing.
The customer could work around the bug by keeping references to threads
in Java until Thread.isAlive fails (meaning sysThreadExit has got past
the race). To fix the bug we might be able to lock the thread queue
around removal and setting of thread_handle, or reorder setting of
thread_handle and removal from the queue.
-- Tim
Stack traces
------------------------------------
With -debug set:
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) where
[1] kill(0x0, 0x6, 0x0, 0xef691448, 0x18, 0xef638ed4), at 0xef6740d8
[2] abort(0xf5d22f34, 0xf0279b08, 0x400100, 0x0, 0x0, 0x0), at 0xef63a530
[3] sysAbort(), line 164 in "system_md.c"
=>[4] scanThread(t = 0xedba0de0, arg = 0xee308888), line 1048 in "gc.c"
[5] sysThreadEnumerateOver(func = &`java_g`gc.c`scanThread(struct sys_thread *t, void *arg) at 0x3d760, arg = 0xee308888), line 606 in "threads_md.c"
[6] scanThreads(), line 1177 in "gc.c"
[7] gc0_locked(async_call = 0, free_space_goal = 1256U), line 1767 in "gc.c"
[8] gc0(async_call = 0, free_space_goal = 1256U), line 1852 in "gc.c"
[9] manageAllocFailure(n = 1256, overflow_type = 1, overflow_act = 1), line 840 in "gc.c"
[10] realObjAlloc(mptr = 0x4de5, n0 = 1246), line 296 in "gc.c"
[11] ArrayAlloc(t = 5, l = 623), line 429 in "gc.c"
[12] ExecuteJava(initial_pc = 0xede70086 "\\x30000057", ee = 0xede70ccc), line 2579 in "interpreter.c"
[13] do_execute_java_method_vararg(ee = 0xede70ccc, obj = 0xee3120e8, method_name = 0xba7f3 "<init>", method_signature = 0xbaa63 "([C)V", mb = 0xbad88, isStaticCall = FALSE, args = 0xede702b0, otherBits = (nil), shortFloats = FALSE), line 529 in "interpreter.c"
[14] execute_java_constructor(ee = 0xede70ccc, classname = 0xba8e5 "java/lang/String", cb = 0xb42b0, signature = 0xa8194 "([C)", ...), line 345 in "interpreter.c"
[15] makeJavaString(str = 0x2e2eb8 " Receiving agent...\\nLauncher is busy\\nmessage:CyberAgent Manager,log saved,###@###.###,Jun 22 18:20:27 EDT 1996, \\n Agent [SystemAgent] sent\\n Agent [SystemAgent] received and accepted\\nSTART SAVING AT 1\\nNumb of outgoing agents = 1\\nsending: sampleAgent null to ###@###.###,\\n Sending agent [sampleAgent]\\nNumber of Received entries = 1\\nnumber of AgentCommInterface = 1\\nAgent Thread: SystemAgent\\nAgent Type: report\\nNumb of running agents = 1\\nNumber of log entries = 1\\nDelivering report" ..., len = 623), line 40 in "string.c"
[16] sun_awt_motif_MTextAreaPeer_getText(this = 0xee300328), line 347 in "awt_TextArea.c"
[17] Java_sun_awt_motif_MTextAreaPeer_getText_stub(_P_ = 0x1f7730, _EE_ = 0xede70ccc), line 882 in "libstubs.c"
[18] invokeNativeMethod(o = 0xee300328, mb = 0x1eb3a8, args_size = 1, ee = 0xede70ccc), line 194 in "classruntime.c"
[19] ExecuteJava(initial_pc = 0xede70ba6 "\\x3000005a", ee = 0xede70ccc), line 2434 in "interpreter.c"
[20] do_execute_java_method_vararg(ee = 0xede70ccc, obj = 0xee308888, method_name = 0xa827c "run", method_signature = 0xa8280 "()V", mb = (nil), isStaticCall = FALSE, args = 0xede70cac, otherBits = (nil), shortFloats = FALSE), line 529 in "interpreter.c"
[21] execute_java_dynamic_method(ee = 0xede70ccc, obj = 0xee308888, method_name = 0xa827c "run", signature = 0xa8280 "()V", ...), line 224 in "interpreter.c"
[22] ThreadRT0(p = 0xee308888), line 61 in "thread.c"
[23] start_func(dummy = 1, arg = -298809208, func = &`java_g`thread.c`ThreadRT0(Hjava_lang_Thread *p) at 0x65500), line 195 in "threads_md.c"
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) list
1058 }
1059 #endif /* TRACEGC */
1060 /* Mark thread object */
1061 MarkPtr(p, HardMark);
1062
1063 base = threadStackBase((TID)p);
1064 ssc = threadStackPointer((TID)p);
1065
1066 if (p == me) {
1067 ssc = (unsigned char **) &me;
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) where
[1] kill(0x0, 0x6, 0x0, 0xef691448, 0x18, 0xef638ed4), at 0xef6740d8
[2] abort(0xf5d22f34, 0xf0279b08, 0x400100, 0x0, 0x0, 0x0), at 0xef63a530
[3] sysAbort(), line 164 in "system_md.c"
=>[4] scanThread(t = 0xedba0de0, arg = 0xee308888), line 1048 in "gc.c"
[5] sysThreadEnumerateOver(func = &`java_g`gc.c`scanThread(struct sys_thread *t, void *arg) at 0x3d760, arg = 0xee308888), line 606 in "threads_md.c"
[6] scanThreads(), line 1177 in "gc.c"
[7] gc0_locked(async_call = 0, free_space_goal = 1256U), line 1767 in "gc.c"
[8] gc0(async_call = 0, free_space_goal = 1256U), line 1852 in "gc.c"
[9] manageAllocFailure(n = 1256, overflow_type = 1, overflow_act = 1), line 840 in "gc.c"
[10] realObjAlloc(mptr = 0x4de5, n0 = 1246), line 296 in "gc.c"
[11] ArrayAlloc(t = 5, l = 623), line 429 in "gc.c"
[12] ExecuteJava(initial_pc = 0xede70086 "\\x30000057", ee = 0xede70ccc), line 2579 in "interpreter.c"
[13] do_execute_java_method_vararg(ee = 0xede70ccc, obj = 0xee3120e8, method_name = 0xba7f3 "<init>", method_signature = 0xbaa63 "([C)V", mb = 0xbad88, isStaticCall = FALSE, args = 0xede702b0, otherBits = (nil), shortFloats = FALSE), line 529 in "interpreter.c"
[14] execute_java_constructor(ee = 0xede70ccc, classname = 0xba8e5 "java/lang/String", cb = 0xb42b0, signature = 0xa8194 "([C)", ...), line 345 in "interpreter.c"
[15] makeJavaString(str = 0x2e2eb8 " Receiving agent...\\nLauncher is busy\\nmessage:CyberAgent Manager,log saved,###@###.###,Jun 22 18:20:27 EDT 1996, \\n Agent [SystemAgent] sent\\n Agent [SystemAgent] received and accepted\\nSTART SAVING AT 1\\nNumb of outgoing agents = 1\\nsending: sampleAgent null to ###@###.###,\\n Sending agent [sampleAgent]\\nNumber of Received entries = 1\\nnumber of AgentCommInterface = 1\\nAgent Thread: SystemAgent\\nAgent Type: report\\nNumb of running agents = 1\\nNumber of log entries = 1\\nDelivering report" ..., len = 623), line 40 in "string.c"
[16] sun_awt_motif_MTextAreaPeer_getText(this = 0xee300328), line 347 in "awt_TextArea.c"
[17] Java_sun_awt_motif_MTextAreaPeer_getText_stub(_P_ = 0x1f7730, _EE_ = 0xede70ccc), line 882 in "libstubs.c"
[18] invokeNativeMethod(o = 0xee300328, mb = 0x1eb3a8, args_size = 1, ee = 0xede70ccc), line 194 in "classruntime.c"
[19] ExecuteJava(initial_pc = 0xede70ba6 "\\x3000005a", ee = 0xede70ccc), line 2434 in "interpreter.c"
[20] do_execute_java_method_vararg(ee = 0xede70ccc, obj = 0xee308888, method_name = 0xa827c "run", method_signature = 0xa8280 "()V", mb = (nil), isStaticCall = FALSE, args = 0xede70cac, otherBits = (nil), shortFloats = FALSE), line 529 in "interpreter.c"
[21] execute_java_dynamic_method(ee = 0xede70ccc, obj = 0xee308888, method_name = 0xa827c "run", signature = 0xa8280 "()V", ...), line 224 in "interpreter.c"
[22] ThreadRT0(p = 0xee308888), line 61 in "thread.c"
[23] start_func(dummy = 1, arg = -298809208, func = &`java_g`thread.c`ThreadRT0(Hjava_lang_Thread *p) at 0x65500), line 195 in "threads_md.c"
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) f 4
f: not found
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) frame 4
Current function is scanThread
1048 sysAbort();
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) list 1040,1050
1040 JHandle *me = (JHandle *) arg;
1041 JHandle *p = (JHandle *) sysThreadGetBackPtr(t);
1042 unsigned char **ssc, **limit;
1043 stackp_t base;
1044 SetLimits();
1045
1046 #ifdef DEBUG
1047 if (!ValidHandle(p)) {
1048 sysAbort();
1049 }
1050 #endif /* DEBUG */
(/usr/dist/local/pkgs/sunpro/5.x-sparc/bin/dbx) c/bin/dbx) where
> =>[1] sysThreadExit(), line 322 in "threads_md.c"
> [2] ThreadRT0(p = 0xee317ec0), line 91 in "thread.c"
> [3] start_func(dummy = 1, arg = -298746176, func = &`java_g`thread.c`ThreadRT0(Hjava_lang_Thread *p) at 0x65500), line 195 in "threads_md.c"
>
The description field as copied from bug report 1263002 follows:
Peter was logged in to my 8MB MrCoffee and trying to get JavaOke
up and running when the following crash happened. The JavaOke
window had appeared, and we tried to select some music, but nothing
was responding.
The TCP RST output indicates that the JavaOke server was not running
and that our attempts at connecting to the JavaOke server were being
rejected.
Here's the dump:
sysUnmapMem 0xe02865d0
sysUnmapMem 0xe0209ff8
sysUnmapMem 0xe02135b0
TCP: SYN_SENT got RST -- connection refused
Fault 0x9 at PC 0xf8048264
AFSR 1860000
AFAR 71d00004
MFSR 18000b0
MFAR 4201bc
SFSR 326
SFAR 10
Aborting...
Full thread dump:
"../../../../src/kona/java/green_threads/src/schedule.c", line 68: assertion _scheduling_lock == 1 failure
*** panic: "../../../../src/kona/java/green_threads/src/schedule.c", line 68: assertion failure
Double abort!
Kona Postmortem: Initialized ethernet 8:0:20:7c:ba:df
JDM1Platform
sritchie:08/05/96-16:59
SCCS/s.GNUmakefile: D 1.193 96/07/30 11:30:29 rberlin 297 296 00003/00003/00815
SCCS/s.Classfiles.gmk: D 1.144 96/07/31 15:12:45 sritchie 250 249 00000/00000/00839
Type 'go' to resume
Type help for more information
ok
PC: ffd4077c
Last leaf: jmpl f82541e0 from f801fcec
0 w %o0-%o5: ( f82541e0 ffeff000 0 71e1000c 80000000 0 )
call f8022ccc from f8023098
1 w %o0-%o5: ( f802307c 10 240 ffd37a9c ffffffff 3d486 )
call f8022d90 from f8023284
2 w %o0-%o5: ( 2a f8626380 240 a2 20 7c )
call f80230d8 from f801d968
3 w %o0-%o5: ( 0 2 f8602000 f8601800 73 0 )
call f801d8e4 from f8025b18
4 w %o0-%o5: ( 62 e01fc228 62 7efefeff 81010100 ff00 )
call f8025b14 from f801c09c
5 w %o0-%o5: ( 0 e01fc698 f8252e5a e01fc7f8 81010100 ff00 )
call f801c040 from f801d9d4
6 w %o0-%o5: ( f8252e38 e01fc698 44 0 0 e01fcbb8 )
call f801d978 from f8047a88
7 w %o0-%o5: ( f8350448 f8350410 44 b 0 0 )
call f8047a38 from f8046e58
8 w %o0-%o5: ( 0 f8046e48 0 f81f6200 e000f840 e0010f48 )
call f8046dac from f80470ec
9 w %o0-%o5: ( e03a4af0 e03a4afc d f801dd7c 800 11 )
call f80470c0 from f804888c
a w %o0-%o5: ( e03a4af0 e01fc8c8 13 7efefeff 81010100 ff00 )
call f8048878 from f800d210
b w %o0-%o5: ( 0 0 c 7efefeff 81010100 ff0000 )
call f800d1f0 from f801d944
c w %o0-%o5: ( 2f 1 f8602000 f8601800 81010100 ff0000 )
call f801d8e4 from f8027c9c
d w %o0-%o5: ( 1 f8602000 f8048264 0 0 0 )
call f8027b08 from f8021014
e w %o0-%o5: ( 9 f8048264 e020ff58 f863c400 0 e01fcbb8 )
call f8047d40 from f8048258
f w %o0-%o5: ( 0 f80a7f38 f81f6250 f81f6200 0 0 )
call f8048224 from f801ad84
10 w %o0-%o5: ( f81f6000 e000f840 f81f6250 f81f6200 e000f840 e0010f48 )
jmpl f801ac10 from f80480d4
11 w %o0-%o5: ( f81f6000 f8602400 0 c8 e02ddecc f816fda0 )
???? from f8048090
12 w %o0-%o5: ( 1 e000f840 f801ac10 0 0 0 )
XXXXXXX from 11
13 w %o0-%o5: ( e0042460 0 e0032e70 11 7 264 )
XXXXXXX from a2
14 w %o0-%o5: ( 0 f8086560 0 f8086560 e00a75d8 f80717a0 )
The full crash log is in ~sritchie/crashes/crash6. The kona binary is
~sritchie/crashes/kona.crash6.
sritchie -- 9 Aug 96
The description field as copied from bug report 1263236 follows:
Reproduability: Hard to reproduce as the applet http://physics.www.media.mit.edu/~metois/MyJava/WebMusic/choir.html
ran for 64 hours and then crashed
Kona Version: 1.0.kI
Kona Builder and Date: SMCC:08/09/96-04:05
Kona GNUMakefile SCCS: 1.196
Kona Classfiles.gmk SCCS: 1.147
Kona Platform is : JavaDesktopMachine
Kona image is l