Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-7059899

Stack overflows in Java code cause 64-bit JVMs to exit due to SIGSEGV

XMLWordPrintable

    • b16
    • x86
    • linux_redhat_5.0
    • Verified

        FULL PRODUCT VERSION :
        java version "1.6.0_24"
        Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
        Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

        FULL OS VERSION :
        Linux 2.6.18-194.17.4.el5 #1 SMP Wed Oct 20 13:03:08 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

        EXTRA RELEVANT SYSTEM CONFIGURATION :
        Java arguments/parameters: -d64 -server -Xms1024m -Xmx1024m -Xss320k -XX:MaxPermSize=384m

        A DESCRIPTION OF THE PROBLEM :
        As our clients have migrated to 64-bit JVMs we've seen a significant increase of JVM crashes due to SIGSEGV on Unix platforms. These are spontaneous and do not trigger a HotSpot crash report. Each crash involved an application bug that caused deep recursion that should have resulted in a java.lang.StackOverflowError, for instance an infinite struts forward or a search continuance/referral loop during LDAP authentication. This appears to affect both Solaris and Linux platforms, however we've only investigated further on Linux, but in both cases a 64-bit JVM was the common factor.

        We referred to the Java SE Troubleshooting Guide section '4.1.3 Crash due to Stack Overflow' and found that likely, the StackShadowPages value is too small for these platforms. The guide discusses custom JNI libraries, however we're seeing these crashes due to 'normal' native calls; usually socket operations, reads or writes. That lead us to investigate further, as this should not be the case. According to the OpenJDK source the default on x84 platforms is 3, and is doubled to 6 on AMD64. There is a x86 Solaris value, seemingly to accomodate C++ compiler bugs on that platform, however our experience has shown that perhaps this value (20) is more broadly applicable:

        http://hg.openjdk.java.net/jdk6/jdk6/hotspot/file/9b013e207574/src/cpu/x86/vm/globals_x86.hpp

               60 #ifdef AMD64
               61 // Very large C++ stack frames using solaris-amd64 optimized builds
               62 // due to lack of optimization caused by C++ compiler bugs
               63 define_pd_global(intx, StackShadowPages, SOLARIS_ONLY(20) NOT_SOLARIS(6) DEBUG_ONLY(+2));
               64 #else
               65 define_pd_global(intx, StackShadowPages, 3 DEBUG_ONLY(+5));
               66 #endif // AMD64

        Lab testing indicates that 17 is the smallest StackShadowPages size that prevents the JVM from crashing with a segmentation fault. We have not confirmed the value on Solaris (UltraSPARC, we don't support our product on Solaris x86), however we have certainly seen these conditions affect both platforms. So we can only conclude that either 64-bit native stack frames on AMD64 are generally far larger than their 32-bit equivalents or there's a problem with the way the value is calculated (I believe it's OS pagesize * StackShadowPages), allowing previously benign stack overflows in Java code to crash the JVM.

        Others have also encountered 64-bit specific SIGSEGVs, bug 6346701 seems to report exactly this kind of condition, however I could not see an outright discussion anywhere that indicated that there could be a problem with the default shipping value or calculation of the number of pages to look ahead before invoking native methods:

        http://confluence.atlassian.com/display/GHKB/JIRA+with+GreenHopper+Crashes+Java+with+a+SIGSEGV+Fault+on+Linux+64bit+JVMs
        http://fusesource.com/forums/thread.jspa?messageID=7830

        We isolated the problem using core dumps and identified the offending threads w/ gdb and then used jstack to generate thread dumps:
         
        -- Crash 1 - Two application methods that call each other recursively, executing database statements, causing overflow during Oracle thin driver socket read:

        Thread 18073: (state = IN_NATIVE)
        - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise)
        - java.net.SocketInputStream.read(byte[], int, int) @bci=84, line=129 (Compiled frame)
        - oracle.net.ns.Packet.receive() @bci=31, line=240 (Compiled frame)
        - oracle.net.ns.DataPacket.receive() @bci=1, line=92 (Compiled frame)
        - oracle.net.ns.NetInputStream.getNextPacket() @bci=48, line=172 (Compiled frame)
        - oracle.net.ns.NetInputStream.read(byte[], int, int) @bci=33, line=117 (Compiled frame)
        - oracle.net.ns.NetInputStream.read(byte[]) @bci=5, line=92 (Compiled frame)
        - oracle.jdbc.driver.T4CMAREngine.buffer2Value(byte) @bci=325, line=2320 (Compiled frame)
        - oracle.jdbc.driver.T4CMAREngine.unmarshalUB4() @bci=2, line=1200 (Compiled frame)
        - oracle.jdbc.driver.T4CTTIoer.unmarshal() @bci=200, line=270 (Compiled frame)
        - oracle.jdbc.driver.T4C8Oall.receive() @bci=1507, line=1015 (Compiled frame)
        - oracle.jdbc.driver.T4CPreparedStatement.doOall8(boolean, boolean, boolean, boolean) @bci=655, line=194 (Compiled frame)
        - oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe() @bci=39, line=791 (Compiled frame)
        - oracle.jdbc.driver.T4CPreparedStatement.executeMaybeDescribe() @bci=104, line=866 (Compiled frame)
        - oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout() @bci=139, line=1186 (Compiled frame)
        - oracle.jdbc.driver.OraclePreparedStatement.executeInternal() @bci=98, line=3387 (Compiled frame)
        - oracle.jdbc.driver.OraclePreparedStatement.executeQuery() @bci=13, line=3431 (Compiled frame)
        - oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery() @bci=4, line=1491 (Compiled frame)
        - org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery() @bci=9, line=93 (Compiled frame)
        - org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery() @bci=9, line=93 (Compiled frame)
        ...

        -- Crash 2 - Infinite LDAP search referral/continuance due to incorrectly configured Active Directory server, overflowing during socket write:

        Thread 11962: (state = IN_NATIVE)
         - java.net.SocketOutputStream.socketWrite0(java.io.FileDescriptor, byte[], int, int) @bci=0 (Interpreted frame)
         - java.net.SocketOutputStream.socketWrite(byte[], int, int) @bci=44, line=92 (Interpreted frame)
         - java.net.SocketOutputStream.write(byte[], int, int) @bci=4, line=136 (Interpreted frame)
         - java.io.BufferedOutputStream.flushBuffer() @bci=20, line=65 (Interpreted frame)
         - java.io.BufferedOutputStream.flush() @bci=1, line=123 (Interpreted frame)
         - com.sun.jndi.ldap.Connection.writeRequest(com.sun.jndi.ldap.BerEncoder, int, boolean) @bci=73, line=396 (Interpreted frame)
         - com.sun.jndi.ldap.LdapClient.ldapBind(java.lang.String, byte[], javax.naming.ldap.Control[], java.lang.String, boolean) @bci=196, line=334 (Interpreted frame)
         - com.sun.jndi.ldap.LdapClient.authenticate(boolean, java.lang.String, java.lang.Object, int, java.lang.String, javax.naming.ldap.Control[], java.util.Hashtable) @bci=315, line=192 (Interpreted frame)
         - com.sun.jndi.ldap.LdapCtx.connect(boolean) @bci=316, line=2694 (Interpreted frame)
         - com.sun.jndi.ldap.LdapCtx.<init>(java.lang.String, java.lang.String, int, java.util.Hashtable, boolean) @bci=390, line=293 (Interpreted frame)
         - com.sun.jndi.ldap.LdapCtxFactory.getUsingURL(java.lang.String, java.util.Hashtable) @bci=227, line=175 (Interpreted frame)
         - com.sun.jndi.ldap.LdapCtxFactory.getLdapCtxInstance(java.lang.Object, java.util.Hashtable) @bci=12, line=134 (Interpreted frame)
         - com.sun.jndi.url.ldap.ldapURLContextFactory.getObjectInstance(java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=17, line=35 (Interpreted frame)
         - javax.naming.spi.NamingManager.getURLObject(java.lang.String, java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=62, line=584 (Interpreted frame)
         - javax.naming.spi.NamingManager.processURL(java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=31, line=364 (Interpreted frame)
         - javax.naming.spi.NamingManager.processURLAddrs(javax.naming.Reference, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=56, line=344 (Interpreted frame)
         - javax.naming.spi.NamingManager.getObjectInstance(java.lang.Object, javax.naming.Name, javax.naming.Context, java.util.Hashtable) @bci=124, line=316 (Interpreted frame)
         - com.sun.jndi.ldap.LdapReferralContext.<init>(com.sun.jndi.ldap.LdapReferralException, java.util.Hashtable, javax.naming.ldap.Control[], javax.naming.ldap.Control[], java.lang.String, boolean, int) @bci=212, line=93(Interpreted frame)
         - com.sun.jndi.ldap.LdapReferralException.getReferralContext(java.util.Hashtable, javax.naming.ldap.Control[]) @bci=38, line=132 (Interpreted frame)
         - com.sun.jndi.ldap.LdapCtx.searchAux(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, boolean, boolean, com.sun.jndi.toolkit.ctx.Continuation) @bci=269, line=1838 (Interpreted frame)
         - com.sun.jndi.ldap.LdapCtx.c_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=14, line=1749 (Interpreted frame)
         - com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=72, line=368 (Interpreted frame)
         - com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls) @bci=32, line=338 (Interpreted frame)
         - com.sun.jndi.ldap.LdapReferralContext.search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls) @bci=44, line=639 (Interpreted frame)
         - com.sun.jndi.ldap.LdapCtx.searchAux(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, boolean, boolean, com.sun.jndi.toolkit.ctx.Continuation) @bci=282, line=1844 (Interpreted frame)
         - com.sun.jndi.ldap.LdapCtx.c_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=14, line=1749 (Interpreted frame)
         - com.sun.jndi.toolkit.ctx.ComponentDirContext.p_search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls, com.sun.jndi.toolkit.ctx.Continuation) @bci=72, line=368 (Interpreted frame)
         - com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.search(javax.naming.Name, java.lang.String, javax.naming.directory.SearchControls) @bci=32, line
        ...

        THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

        THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

        STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
        We do not have an easily reducible use case to reproduce this problem. All issues involve our entire application stack.

        ERROR MESSAGES/STACK TRACES THAT OCCUR :
        None. JVM process exits due to signal 11 (SIGSEGV).

        REPRODUCIBILITY :
        This bug can be reproduced always.

        CUSTOMER SUBMITTED WORKAROUND :
        Set -XX:StackShadowPages=20

              coleenp Coleen Phillimore
              webbuggrp Webbug Group
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: