Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4844565

UseParallelGC problem with 1.4.1_01 and 1.4.1_02 on IA-32 "Foster" chips.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: P2 P2
    • None
    • 1.4.1_02
    • hotspot
    • gc
    • x86
    • windows_2000

      Two reports follow, both on the same problem (escalation with sustaining
      engineering might follow):


      REPORT #1
      ---------

      I would like to report a problem that we have been seeing with the Sun
      JVM 1.4.1_01 and 1.4.1_02.

      This problem can be reproduced by running the industry standard benchmark
      SpecJBB 2000 (version 1.02). The following command-line can be used:

      java -server -verbosegc -XX:NewSize=634m -XX:MaxNewSize=634m -Xms1600m
      -Xmx1600m -XX:+UseParallelGC -Xbatch -Xss128k -cp
      .\jbb.jar;.\jbb_no_precompile.jar;.\check.jar;.\reporter.jar;.
      spec.jbb.JBBmain -propfile SPECjbb.props

      There is nothing special about this command-line, other than turning
      on UseParallelGC. If UseParallelGC is not enabled the problem does not
      occur. If UseParallelGC is enabled then we have only observed this problem
      on our servers when populated with IA-32 1.6GHz Xeon MPs ("Foster") --
      i.e. if we use the same system, backplane et al, but replace the CPUs to
      anything other than Foster chips then we have not been able to reproduce
      the problem. When configured with IA-32 Foster CPUs, the problem occurs
      at arbitrary points (although always during GC - as verified with verbosegc).
      Sometimes the benchmark is successful without encountering the faults,
      so in order to reproduce the problem we have to run the test more than
      once or lengthen the duration of the benchmark.

      We have enountered a number of symptoms -- e.g. here are a couple of stack
      traces at the point of failure:

      MarkSweep::follow_stack() line 92 + 9 bytes
      PSMarkSweep::mark_sweep_phase1(int & 0, int 0) line 255
      PSMarkSweep::invoke_at_safepoint(int 0, int & 0) line 89 + 26 bytes
      PSScavenge::invoke_at_safepoint(unsigned int 0, int 1, int & 0) line 383 +
      11 bytes
      ParallelScavengeHeap::collect_at_safepoint(ParallelScavengeHeap * const
      0x004f005f, ParallelScavengeHeap::CollectionType MarkSweep, unsigned int 0,
      int & 0) line 244 + 25 bytes
      VM_ParallelScavengeGCCollect::doit(VM_ParallelScavengeGCCollect * const
      0x004f005f) line 125
      VM_Operation::evaluate(VM_Operation * const 0x004f005f) line 30
      VMThread::evaluate_operation(VMThread * const 0x004f005f, VM_Operation *
      0x6d630aab) line 258
      VMThread::loop(VMThread * const 0x004f005f) line 334
      VMThread::run(VMThread * const 0x004f005f) line 186
      _start(Thread * 0x00000000) line 286


      instanceKlass::oop_follow_contents(instanceKlass * const 0x2b7eb748, oopDesc
      * 0x6d598ec0) line 986
      MarkSweep::mark_and_follow(oopDesc * * 0x2b7eb708) line 58 + 13 bytes
      objArrayKlass::oop_follow_contents(objArrayKlass * const 0x2b7eb748, oopDesc
      * 0x2b7eb6f8) line 211 + 6 bytes
      MarkSweep::follow_stack() line 94 + 12 bytes
      PSMarkSweep::mark_sweep_phase1(int & 1, int 710017024) line 254
      09e80000()
      66d38778()

      In addition, we have seen the problem apparently manifest itself by becoming
      "stuck" repeatedly entering GC cycles while the benchmark issues
      NullPointerException's.

      It appears that as the number of CPUs used increases (and number of GC
      threads), the chances of the problem appearing also increases.
      e.g. a 32x configuration appears to be more susceptible to the problem
      than a 16x than an 8x.

      I dont know how difficult it might be for you to reproduce or recognize
      the problem. It is strange that we have only encountered it to date with
      the Foster chips.

      I'm guessing that some heap state has gotten corrupted which later leads to
      the faults but am not clear how to pursue deeper in order to give you more
      information.


      REPORT #2:
      ---------

      The problem looks familiar to bug 4827353 ("atomic::membar doesn't on x86")
      for 1.4.2, which has been fixed. The patches were to the following two
      files:

        \hotspot\src\cpu\i486\vm\assembler_i486.cpp
        \hotspot\src\os_cpu\win32_i486\vm\atomic_win32_i486.inline.hpp

      I applied these changes to both the JVM port we have for our system,
      as well as to the base Sun JVM, version 1.4.1_01 which is the base for
      our current JVM.

      Having tested these patches and observed the same failure with the
      Sun JVM as before, it appears that this problem reported in 1.4.1_01 and
      1.4.1_02 is not fixed by these patches.

      Is it possible that these fixes that were added to 1.4.2 are dependent
      on other changes made earlier in that stream, or that I overlooked
      additional changes specifically related to this bug fix?

      The problem has only appeared when using 1.6GHz "Foster" IA-32 in a
      multiprocessor system (8x, 16x, 32x etc....) Replace the Intel chips
      for faster or slower and we have no signs of trouble. It is still a
      big mystery...

            pbk Peter Kessler
            clucasius Carlos Lucasius (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: