Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4472895

zero out the instructions when threads are currently executing causes VM crash

XMLWordPrintable

    • beta2
    • generic, sparc
    • solaris_7, solaris_8

      VM crash bug reported by one of CAP members. Their tests are not
      very portable, it would take a lot of work to set up and run.
      Error message(hs_err_pid26796.log) and core files were attached instead.
      They are willing to work with someone in HotSpot team to narrow it down
      and produce a case outside their product code if possible.

      ----------------------------------------------------------------------------------
      J2SE Version (please include all output from java -version flag):

      java version "1.4.0-beta"
      Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.0-beta-b65)
      Java HotSpot(TM) Client VM (build 1.4.0-beta-b65, mixed mode)

      Does this problem occur on J2SE 1.3? Yes / No (pick one)

      No

      Operating System Configuration Information (be specific):

      Solaris 7 with the latest patches.

      Hardware Configuration Information (be specific):

      This was run on a Sun 420R quad processor with 1 GB RAM

      Bug Description:
      When using more than one thread in a section of code it is possible
      (it seems) for HotSpot to zero out the instructions which one of the
      two (or more) threads is currently executing (or will execute in the
      near future). This appears to happen after executing the code several
      hundred times, as though HotSpot is coming back and reoptimizing the
      code segment by zeroing the old instructions out and then letting another
      thread kill the JVM because the SPARC processor cannot execute the
      instruction 0x0.


      Detail problem description from customer:
      +++++++++++++++++++++++++++++++++++++++++

      JVM was configured with:
      Tests 1 - 7: -server -Xms128m -Xmx512m
      Tests 8: -server -Xms512m -Xmx512m
      Tests 9: -server -Xms512m -Xmx512m

      Tests:
      Test 1:
      Running any test against MapXtreme Java causes the JVM to crash.
      Every crash is caused specifically by Solaris delivering a SIGILL
      to one of the threads running inside of the JVM. The JVM traps
      the SIGILL (signal 4) and prints an error report to the console,
      and then calls abort() to create a core file. Note that abort()
      kills its calling process by sending a SIGKILL (signal 9) to itself.

      This "test" was actually run many times with differing user loads.
      The number of virtual users ranged from 50 down to 1, always running
      in stress mode (no think time). In all tests involving more than 10
      users, 10 users were started at test start, and the number of users
      was ramped up at a rate of 10 users per minute.

      Crashes produced during this test always had the same core image
      file "appearance". Many threads were active at once (say 50 threads),
      but only one had received the SIGILL which caused the process to shutdown.
      Additionally, the stack frame for that thread had 24 HotSpot created
      functions on the stack frame. (HotSpot created functions do not
      have symbol entries in the symbol tables and as a result appear
      "??" in GDB.)

      Oddly enough, all memory around the instruction which caused the
      SIGILL (and the instruction itself) are zerod out in the core file.
      Is this a "feature" of the core dump facility on Solaris, a bug
      in GDB or really what happened to the JVM? (i.e. did HotSpot or
      the GC zero out memory it wasn't supposed to zero out?) I don't
      think this is a GDB bug, as other core files created due to SIGSEGV
      seem to have legal SPARC instructions at the reported PC.

      Test 2:
      What effect does single threading the server have by reconfiguring
      Silk to only issue one request at a time per user? If this the
      crash is thread related we should not see a crash in a single
      threaded case.

      One virtual user was configured to issue only one request at a time,
      but with no think time between requests.

      Result: 3,559 requests in 28 minutes. No errors.

      Test 3:
      If test #2 holds true, that single threading works, then running only
      two threads may be able to reproduce the crash.

      One virtual user was configured to issue two concurrent requests
      at a time, with no think time between requests.

      Result: crash around 5,000 requests after 10 minutes of load. Crash
      appears to be the same as observed in test #1. (SIGILL delivered
      with 24 stack frames of HotSpot created functions.)

      Test 4:
      According to the help for the "java" command, -Xbatch should prevent
      HotSpot from replacing the code of a method at runtime. Since the
      crash occurs only after processing several hundred requests successfully,
      the crash must be caused by either replacement code generated by
      HotSpot in the middle of the test or a periodic background cleanup
      task triggering off. Since the single user case held, I'm leaning
      towards the former case. I'm also thinking that perhaps HotSpot is
      changing the code for a method while another thread is attempting to
      execute it, and the crash is because the executing thread is seeing
      the machine code in the middle of the change (an unstable state).

      Result: Using -Xbatch just causes the JVM to suffer from many
      NoClassDefFoundError exceptions in com.ibm.xml.parser.Token.getName.
      Very odd error. I can only conclude that -Xbatch does not work in
      this version of the JVM. This test was run many times with the
      same result. The good news is the JVM does not SIGILL when using
      -Xbatch, but I don't think it gets far enough to receive the SIGILL point.

      Additionally, the stack trace created from the NoClassDefFoundError
      is 34 methods deep. Unless HotSpot was able to inline 10 methods to
      reduce the stack depth, this NoClassDefFoundError cannot be the problem
      we are seeing in the other 3 tests. On top of this, absolutely no
      request completes successfully with -Xbatch enabled, so this is really
      not of any help.

      Test 5:
      I split the client servlet (SimpleMapTestPlus) into its own JVM,
      seperate from the MapXtreme Java server servlet. Both ran on the same
      machine underneath the same Apache server.

      The result was the same as all other tests (except test #4). The
      server JVM process died with a SIGILL and its stack trace (as reported
      by GDB) shows 24 HotSpot functions on the call stack.

      Test 6:
      Paul Jossman suggested using MapXtreme Java 4.0 build 20 as the XML
      parser has been switched away from IBM's XML parser to Apache's Xerces
      parser. This test was run with a single virtual user allowed to
      make 4 concurrent connections to MapXtreme Java. Both client and
      server servlets were in the same JVM, and no think time was allowed
      between requests.

      The JVM made it through about 4,000 requests before it died. Its
      death yielded the same SIGILL and 24 frame stack trace as every
      other crash.

      Test 7:
      MapXtreme Java 4.0 build 20 was tested with -Xbatch enabled to see
      if this cleared up the NoClassDefFoundError seen in Test 4.

      After 186 successfully requests, the JVM died with a SIGSEGV (signal 11).
      At least with the Xerces XML parser we do not see the ClassDefNotFoundError.
      Examining the core file in gdb reveals that the thread which caught
      the signal was executing in a JVM internal method:

      int PhaseChaitin::strech_base_pointer_live_rangs(ResourceArea*)

      Some of the classes calling this method seemed to refer to the runtime
      HotSpot compiler. Perhaps the reason for the crash is an invalid
      pointer dereference in the HotSpot compiler itself.

      Test 8:
      Under the assumption that test 7's error was a result of trying to
      increase the size of the heap during runtime, I resized the initial
      heap to be the same as the maximum. However, since the call stack
      contained references to the "Compiler" object, I doubt this is the case.

      The test ran successfully for an hour, completing 18,278 image requests
      for one virtual user, no think time and 4 concurrent connections.
      12 hours after the test completed (around 4:20 am) when there was no
      load on the server the JVM randomly SIGSEGVd.

      Conclusions:
      It would seem as though this particular version of the JVM has some
      errors in its HotSpot code generator (the runtime compiler). On Solaris,
      after a period of time we see the 32 bit JVM crash with a SIGILL
      having been delivered to the process. The SIGILL is always issued at
      the same point in our software: 24 stack frames down on the runtime
      stack. Each of these entries is a dynamicly created method, with only
      the HotSpot error trapping code on one end of the stack frame and the
      JVM thread root functions on the other. There appear to be 9 functions
      assocaited with the Tomcat call stack, leaving the last 15 functions
      to be (possibly) ones from MapXtreme Java's server servlet.

      It would seem as though the crash occurs after handling about 533
      requests. Typically the easiest way to cause the crash is to run
      MapXtreme Java with 1 user stress test loading all 13 images in
      the MapXtreme Java 3.1 test Shawn B. created. In this test, Silk
      is running 4 concurrent connections per user to the server.

      Perhaps the issue with multiple threads is that HotSpot has recreated
      the instructions for the method, but has somehow screwed up in copying
      the new instructions into the method's storage in memory. As a result,
      some other thread calls into the new method before the new method is
      truely ready for execution, and trys to execute a partially complete
      instruction, or something which was not an instruction (but rather older
      data laying in memory). Is the invalid instruction a null word or
      something silly like that?

      It would seem as though load has no bering on when this crash will
      occur, as even one user (with no think time) can bring the JVM down with
      this error. I now have 3 core dumps showing identical stack traces from
      a thread dying with this SIGILL. Unfortunately, since HotSpot
      generates the code on the fly, there is no symbol table associated
      with the stack frames to uncover what method of MapXtreme Java is
      causing the error. If we can identify the current method of the thread
      that received the SIGILL signal, perhaps we can give Sun a test case
      which can reproduce the error.

      With newer releases of the 1.4 JVM, we have to wonder if this
      particular HotSpot bug has been fixed, or if the bug is still present
      but other changes to HotSpot's runtime compiler will cause the number
      of stack frames seen and their alignment in memory to be shifted such
      that it doesn't appear to be the same error.

      All fingers point to MapXtreme Java as the software causing HotSpot
      to generate illegal machine code, as the NullServlet test with Tomcat
      did not suffer from this runtime problem. Since we can run several
      hundred successful requests through the server before the crash, I am
      lead to believe that either HotSpot recompiles the victim method
      improperly as a performance improvement, or that the victim method
      really is just called only infrequently by MapXtreme Java (and as it
      happens is only called once every few hundred requests as a background
      cleanup process for example).

      Update:

      After examining most of the core files from the tests, it would appear
      as though the memory has been zeroed around the instruction which is
      causing the illegal instruction. I examined the stacks of every
      active thread visible in the core files, and it would appear as though
      the memory was zeroed while the victim thread was sleeping. When it
      woke up it died with a SIGILL. It is not known when the memory zeroing
      occured - it may have occured while the thread was waiting for IO or a
      system call (or AWT) call to complete, and then it called a HotSpot'd
      method which had been zeroed over. Or it was preempted, and while
      waiting for control one or more methods were zeroed behind its back.

      This really looks like a race condition, and the method being zeroed
      looks like its also a MapXtreme Java server side method.

            cclicksunw Clifford Click (Inactive)
            tyao Ting-Yun Ingrid Yao (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: