Loading...

Type: Bug
Resolution: Duplicate
Priority: P1
Fix Version/s: None
Affects Version/s: unknown
Component/s: hotspot
Labels:

Subcomponent:
runtime
CPU:

sparc
OS:

solaris_2.5.1

Notes from customer:

**********************
**********************

Greetings:

We have discovered a significant flaw within the Java Virtual Machine
(VM), and request your assistance. The particulars:

    * Sun JDK release 1.1.3.
    * Solaris 2.5.1, UltraSparc Creator.
    * 128 MB RAM, 600MB swap.
    * VM arguments: -mx350m -ms25m -verbosegc

Our project is a mid- to large-size application (just over 200
classes) that models a network of our central office ATM switches. The
software is 100% Java, and is best classified as a "long-running,
system-software" type application. Within the simulator, it is not
uncommon to have many thousands of extant objects, representing
network elements, interface cards, connections, etc.

The simulator is in beta release, and we have begun some "stress
testing" that will mimic a reasonably-loaded ATM switch. Yesterday, we
were loading a "switch" with approximately 5000 connections when the
VM failed with a SIGSEGV. The loading process was 75% completed. We
noticed that, at the point of failure, the garbage collector had just
begun executing. We initiated a second run of the simulator; this time
we got slightly further before the VM failed, same cause. We examined
the core file with dbx; most of the stack had been hosed, but we did
get a partial trace. What follows is the data provided by dbx, plus a
snapshot of the VM garbage collector diagnostics prior to and at the
point of VM failure:

----------- cut here --------------------

********** dbx output ********************

Mr. Peabody: dbx /usr/local/java/JDK/bin/sparc/green_threads/java core.08211706
Reading symbolic information for /usr/local/java/JDK/bin/sparc/green_threads/java
core file header read successfully
Reading symbolic information for rtld /usr/lib/ld.so.1
Reading symbolic information for /usr/local/java/JDK/lib/sparc/green_threads/libjava.so
Reading symbolic information for /opt/SUNWspro/lib/libm.so.1
Reading symbolic information for /usr/lib/libdl.so.1
Reading symbolic information for /usr/openwin/lib/libX11.so.4
Reading symbolic information for /usr/lib/libsocket.so.1
Reading symbolic information for /usr/lib/libnsl.so.1
Reading symbolic information for /usr/lib/libc.so.1
Reading symbolic information for /usr/openwin/lib/libXext.so.0
Reading symbolic information for /usr/lib/libw.so.1
Reading symbolic information for /usr/lib/libintl.so.1
Reading symbolic information for /usr/lib/libmp.so.1
Reading symbolic information for /usr/platform/SUNW,Ultra-1/lib/libc_psr.so.1
Reading symbolic information for /usr/local/java/JDK/lib/sparc/green_threads/libnet.so
Reading symbolic information for /usr/lib/nss_nis.so.1
program terminated by signal SEGV (Segmentation Fault)
dbx: core file read error: address 0xd93a1ddc not in data space
dbx: core file read error: address 0xd93a1ddc not in data space
dbx: attempt to read stack failed - bad frame pointer
(dbx 1) where
=>[1] markChildren(0xdcfbb998, 0xdd9fb780, 0x4898, 0xd8800000, 0x3, 0x1e), at 0xef6dbed8
dbx: core file read error: address 0xd93a1ddc not in data space
dbx: attempt to read stack failed - bad frame pointer
(dbx 2)

************* last successful GC invocation **************

<GC: managing allocation failure. need 840 bytes, type=1, action=1>
<GC: freed 207271 objects, 7968160 bytes in 3130 ms, 24% free (11083904/45481976
)>
  <GC: init&scan: 6 ms, scan handles: 2889 ms, sweep: 235 ms, compact: 0 ms>
<GC: managing allocation failure. need 840 bytes, type=1, action=2>
<GC: managing allocation failure. need 840 bytes, type=1, action=3>
<GC: managing allocation failure. need 840 bytes, type=1, action=4>
<GC: expanded object space by 385024 to 45867000 bytes, 25% free>

************ point of VM failure *******************

<GC: managing allocation failure. need 840 bytes, type=1, action=1>
Segmentation Fault (core dumped)

-------------- cut here -------------------

This is all the concrete data we have, other than simulator logs and
the core file. We are suspicious, however, that something is amiss
within the garbage collector. We base this supposition on data gathered
during close monitoring of the GC output during the two simulator
runs.

It *appears* that, at some point, the GC becomes misinformed
about the amount of memory actually available and begins unnecessarily
requesting new chunks. Refering to the "last successful GC invocation"
above, you'll notice the GC was searching for 840 bytes and couldn't
find them. Consequently, he began his collection process and freed
7968160 bytes. Now, looking at the line that indicates free memory
(24% free), it seems to us that, prior to the GC's sweep he had
11083904 - 7968160 = 3115744 bytes available. Could the GC heap be so
fragmented that it wasn't possible to allocate an 840 byte chunk?

An important point here. During both simulator runs, the OS had
*plenty* of virtual memory available; a shortage of swap space is not
an issue here.

Please note that this analysis is *very* rough work; we are unfamiliar
(other than surface details) with the GC's actual algorithm and may be
drawing incorrect conclusions.

We suspect it will be difficult to provide you with an isolated code
fragment that manifests the VM failure. Small test applications that
simply create a slew of objects run with no difficulty. However, we're
willing to provide you with as much data as required to solve this
problem (assuming non-disclosure).

A final note. As previously mentioned, our product is 100% Java; there
is *no* native code involved, other than what exists within and is
required by the VM and JDK.

Regards,

--

-------------------------------------------------------------
David E. Young
Fujitsu Network Communications "I claim not to have controlled
(###@###.###) events, but confess plainly
                                 that events have controlled me."
                                  -- Abraham Lincoln (1864)
"Apology is Policy"

duplicates

JDK-4034641 the java virtual machine throws a Segmentation Fault. The probable cause is a bu

Closed

Details

Description

Attachments

Issue Links

Activity

People

Dates