Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6413516

4 unusual very long ParNew GC

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: P4 P4
    • None
    • 1.4.2
    • hotspot
    • gc
    • 1.4.2
    • sparc
    • solaris_9

      FULL PRODUCT VERSION :
      "1.4.2_09" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_09-b05) Java HotSpot(TM) 64-Bit Server VM (build 1.4.2_09-b05, mixed mode)

      FULL OS VERSION :
      SunOS mnbas-prop-01 5.9 Generic_117171-05 sun4u sparc SUNW,Netra-T12

      EXTRA RELEVANT SYSTEM CONFIGURATION :
      64GB RAM

      A DESCRIPTION OF THE PROBLEM :
      On 24/03/2006, with a customer, we made a stress test on our server application.
      About 100 clients were connected during this test.
      We experienced 4 unusual very long ParNew GC.
      5360.712/14:35:50 : duration 427s (7min7s)
      6229.767/14:50:20 : duration 426s (7min8s)
      6658.591/14:57:29 : duration 453s (7min33s)
      7112.431/15:05:03 : duration 479s (7min59s)
      The 3 last GCs occured consecutively with almost no duration for application time.
      The server ran for only 1h30 when these long GCs appeared.
      We had to restart our application, because it appeared "frozen" for 22min40s.
      According to GC logs, only 13GB (out of 32GB) were used.

      Here are the JVM options :
      java -d64 -XX:+DisableExplicitGC -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:-CMSParallelRemarkEnabled -XX:PermSize=64m -XX:CMSInitiatingOccupancyFraction=30 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:xxx

      --- GC logs extracts
      5360.712: [GC {Heap before GC invocations=3514:
      Heap
      par new generation total 53184K, used 53004K [0xfffffff6b2400000,
      0xfffffff6b5800000, 0xfffffff6b5800000)
      eden space 53120K, 99% used [0xfffffff6b2400000,
      0xfffffff6b57c3338,
      0xfffffff6b57e0000)
      from space 64K, 0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
      0xfffffff6b57f0000)
      to space 64K, 0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
      0xfffffff6b5800000)
      concurrent mark-sweep generation total 33738880K, used 13533825K
      [0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
      concurrent-mark-sweep perm gen total 65536K, used 23016K
      [0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
      5360.713: [ParNew: 53004K->0K(53184K), 427.4508010 secs]
      13586830K->13537652K(33792064K) Heap after GC invocations=3515:
      Heap
      par new generation total 53184K, used 0K [0xfffffff6b2400000,
      0xfffffff6b5800000, 0xfffffff6b5800000)
      eden space 53120K, 0% used [0xfffffff6b2400000,
      0xfffffff6b2400000,
      0xfffffff6b57e0000)
      from space 64K, 0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
      0xfffffff6b5800000)
      to space 64K, 0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
      0xfffffff6b57f0000)
      concurrent mark-sweep generation total 33738880K, used 13537652K
      [0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
      concurrent-mark-sweep perm gen total 65536K, used 23016K
      [0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
      } , 427.4515902 secs]
        Total time for which application threads were stopped: 427.4592638
      seconds
      ---
      6229.767: [GC {Heap before GC invocations=3516:
      Heap
      par new generation total 53184K, used 53120K [0xfffffff6b2400000,
      0xfffffff6b5800000, 0xfffffff6b5800000)
      eden space 53120K, 100% used [0xfffffff6b2400000,
      0xfffffff6b57e0000,
      0xfffffff6b57e0000)
      from space 64K, 0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
      0xfffffff6b57f0000)
      to space 64K, 0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
      0xfffffff6b5800000)
      concurrent mark-sweep generation total 33738880K, used 13547239K
      [0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
      concurrent-mark-sweep perm gen total 65536K, used 23021K
      [0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
      6229.767: [ParNew: 53120K->0K(53184K), 426.6925385 secs]
      13600359K->13556730K(33792064K) Heap after GC invocations=3517:
      Heap
      par new generation total 53184K, used 0K [0xfffffff6b2400000,
      0xfffffff6b5800000, 0xfffffff6b5800000)
      eden space 53120K, 0% used [0xfffffff6b2400000,
      0xfffffff6b2400000,
      0xfffffff6b57e0000)
      from space 64K, 0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
      0xfffffff6b5800000)
      to space 64K, 0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
      0xfffffff6b57f0000)
      concurrent mark-sweep generation total 33738880K, used 13556730K
      [0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
      concurrent-mark-sweep perm gen total 65536K, used 23021K
      [0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
      } , 426.6934510 secs]
        Total time for which application threads were stopped: 426.7247834
      seconds
      ---
      6658.591: [GC {Heap before GC invocations=3517:
      Heap
      par new generation total 53184K, used 53120K [0xfffffff6b2400000,
      0xfffffff6b5800000, 0xfffffff6b5800000)
      eden space 53120K, 100% used [0xfffffff6b2400000,
      0xfffffff6b57e0000,
      0xfffffff6b57e0000)
      from space 64K, 0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
      0xfffffff6b5800000)
      to space 64K, 0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
      0xfffffff6b57f0000)
      concurrent mark-sweep generation total 33738880K, used 13556730K
      [0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
      concurrent-mark-sweep perm gen total 65536K, used 23025K
      [0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
      6658.592: [ParNew: 53120K->0K(53184K), 453.2627707 secs]
      13609850K->13564527K(33792064K) Heap after GC invocations=3518:
      Heap
      par new generation total 53184K, used 0K [0xfffffff6b2400000,
      0xfffffff6b5800000, 0xfffffff6b5800000)
      eden space 53120K, 0% used [0xfffffff6b2400000,
      0xfffffff6b2400000,
      0xfffffff6b57e0000)
      from space 64K, 0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
      0xfffffff6b57f0000)
      to space 64K, 0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
      0xfffffff6b5800000)
      concurrent mark-sweep generation total 33738880K, used 13564527K
      [0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
      concurrent-mark-sweep perm gen total 65536K, used 23025K
      [0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
      } , 453.2636434 secs]
        Total time for which application threads were stopped: 453.4103892
      seconds
      ---
      7112.431: [GC {Heap before GC invocations=3518:
      Heap
      par new generation total 53184K, used 53120K [0xfffffff6b2400000,
      0xfffffff6b5800000, 0xfffffff6b5800000)
      eden space 53120K, 100% used [0xfffffff6b2400000,
      0xfffffff6b57e0000,
      0xfffffff6b57e0000)
      from space 64K, 0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
      0xfffffff6b57f0000)
      to space 64K, 0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
      0xfffffff6b5800000)
      concurrent mark-sweep generation total 33738880K, used 13564527K
      [0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
      concurrent-mark-sweep perm gen total 65536K, used 23025K
      [0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
      7112.432: [ParNew: 53120K->0K(53184K), 479.2269278 secs]
      13617647K->13573097K(33792064K) Heap after GC invocations=3519:
      Heap
      par new generation total 53184K, used 0K [0xfffffff6b2400000,
      0xfffffff6b5800000, 0xfffffff6b5800000)
      eden space 53120K, 0% used [0xfffffff6b2400000,
      0xfffffff6b2400000,
      0xfffffff6b57e0000)
      from space 64K, 0% used [0xfffffff6b57f0000, 0xfffffff6b57f0000,
      0xfffffff6b5800000)
      to space 64K, 0% used [0xfffffff6b57e0000, 0xfffffff6b57e0000,
      0xfffffff6b57f0000)
      concurrent mark-sweep generation total 33738880K, used 13573097K
      [0xfffffff6b5800000, 0xfffffffec0c20000, 0xffffffff32400000)
      concurrent-mark-sweep perm gen total 65536K, used 23025K
      [0xffffffff32400000, 0xffffffff36400000, 0xffffffff36400000)
      } , 479.2278830 secs]
        Total time for which application threads were stopped: 479.3643864
      seconds
      The full GC logs are available at this address :
      http://www.mycom-int.fr/mdds_gc_20060324130625.log.zip



      Here are some mails exchanged with a SUN engineer :
      ------- Forwarded message -------
        From: "Jon Masamitsu" <###@###.###>
        To: "David Tavoularis" <###@###.###>
        Subject: Re: GC Feedback
      Date: Thu, 30 Mar 2006 17:49:46 +0300

      David,

      Do you have a testcase demonstrates this problem that you can send to us?

      If not, just a few questions.

      Does this problem occur everytime you run your applications?

      Is the problem occurring at about the same point in each run
      (i.e., after about the same amount of run time or at a point where the heap is about the same in terms of maximum size and amount being used?

      Does the application change behavior at a particular point (e.g., start allocating significantly more data or start allocating very large objects).
      Your mention 4 long GC's.
      I noted 5 and that once the long GC's started they continued (no short GC's after).
      I'm looking for something that has changed that would lead to the long GC's.

      How many cpus are on your test machine?

      Is there anything else running on your test machine when you see this bad behavior?

      If you have a second log that shows the problem, we might be able to get so information from it by comparing the two.

      Jon

      ------- Forwarded message -------
        From: "Philippe Bonneau" <###@###.###>
        To: "David Tavoularis" <###@###.###>, ###@###.###
        Subject: Re: GC Feedback
      Date: Thu, 30 Mar 2006 19:34:35 +0300

      Hello Jon,

      Just in order to explain why I will answer "no" to many of your questions, a few explanations on the context :
      we have been struggling for several months implementing in our application a special mechanism that self-monitors the tenured occupancy of our server and automatically slows down its activity when the tenured generation fills up (in order to wait for the current CMS to complete), to be able to guarantee that we will NEVER undergo a stop-the-world collection.
      In the meantime, our customer had to survive with several daily 20-minute stop-the-world collections.
      Now that we have at last installed our long expected "miracle solution" (which mostly works), they have run a stress test session to validate it and they experienced the problem described below with the New generation!
      Eventhough we tried to explain that the cause is very different, from the end user's point of view all they see is the system freezing again and some of them question that we fixed anything at all.
      They encountered the problem only once, but needless to say that in this context, the customer is certainly not keen on helping us reproducing it. For information, they mobilized 100 people at the same time to conduct the test, so it is not something we can ask them to re-do on demand.

      That said, please my answers below, many thanks in advance for your help.

      >Do you have a testcase demonstrates this problem that you can send to us?
      No, in fact we were hoping that you could give us some leads to write one that would reproduce the symptoms.

      > Does this problem occur everytime you run your applications?
      Happened only once, under high stress; but given the context we are very
      reluctant to tell our customer not to worry and that the problem is very rare.

      > Is the problem occurring at about the same point in each run (i.e., after about the same amount of run time or at a point where the heap is about the same in terms of maximum size and amount being used?
      Happened only once, but the system has passed successfully this amount of RAM many times.

      > Does the application change behavior at a particular point (e.g., start allocating significantly more data or start allocating very large objects).
      This is quite possible and is our primary suspicion, but our logs do not report every single click of every single user, so we do not have any specific information.
      The sequence of long GCs is easily explained, as the system in under high activity (as you can see above, New GCs are triggered several times per *second*).
      Conversely, can you let us know if for example repeated allocations requests of huge objects could cause such symptoms ?

      > How many cpus are on your test machine?
      It has 16 CPUs. This is unfortunately a live machine ! The problem never occurred on a test server.

      > Is there anything else running on your test machine when you see this bad behavior?
      Yes, other processes belonging to our application. They are not very suspicious however, because as mentioned above, the problem occurred during a stress test of the concerned process, so we would really be out of luck if the problem was caused by a side effect of an other process (which usually run 24 hours a day without a glitch).

      > If you have a second log that shows the problem, we might be able to get so information from it by comparing the two.
      We don't

      ------- Forwarded message -------
        From: "Jon Masamitsu" <###@###.###>
        To: "Philippe Bonneau" <###@###.###>
        Subject: Re: GC Feedback
      Date: Thu, 30 Mar 2006 20:30:39 +0300

      Phillipe,

      Usually when I see such a large change in the ParNew collection times
      I think if swapping. It's hard to positively identify swapping as
      a problem unless you're looking for it. Any chance the customer
      was monitoring the swapping activity on his system when the long
      GC's occurred? If you want to see the effect of swapping you
      can run a test system with a heap that is larger than physical memory
      (with the size of the heap less than the amount of swap space,
      of course).

      During the development cycle for jdk 6.0 we identified a problem
      with ParNew collections of very large objects that exhibited itself
      as long pauses. It is fixed in 6.0. I believe the fix was also
      backported to 5.0. If you are running into this problem if you
      use -XX:+ConcMarkSweepGC with -XX:-UseParNewGC it goes away.

      You mention that the customer was seeing 20 minute stop-the-world
      collections occasionally when using CMS. Do you happen to know
      how long the tenured generation collections were if the
      parallel collector (-XX:+UseParallelGC) is used? I'll just
      wondering if the 20 minute collection time is way out of bounds
      for a CMS stop-the-world collection.

      Lastly, you guys sound like you definitely know how to tune CMS.
      May I ask how you learned to use our collectors?

      Jon


      ------- Forwarded message -------
        From: "Jon Masamitsu" <###@###.###>
        To: "Philippe Bonneau" <###@###.###>
        Subject: Re: GC Feedback
      Date: Sat, 01 Apr 2006 04:51:35 +0300

      David and Phillipe,

      I haven't figured out anything that would explain this problem.
      I've sent your mail around to some other members of the group and
      we'll see what we can come up with.

      The performance problem with large objects
      can be seen with a program that allocates
      a couple of 120m object that stay alive
      through the life of the program and then
      allocates smaller objects to force GC's.
      Once the large objects get promoted to the
      old gen, the par new collections slow down.
      You see the affect without cms (just
      add -XX:+UseParNewGC). There will be a big
      difference between the young generation
      collection times when compared with the
      default collector.

      In the mean time how did you decide on the size for the
      young generation? It seems a little small (52m) compared to
      the size of the old generation. At the end of the gc log it looked like
      the amount of live data in the old generation was about
      10g. Is that the size of the steady state working set.
      By default cms uses small survivor spaces. Something
      like -XX:SurvivorRatio=6 and -XX:MaxTenuringThreshold=15
      might help slow down promotions into the old generation.

      Jon

      ------- Forwarded message -------
        From: "Jon Masamitsu" <###@###.###>
        To: "Philippe Bonneau" <###@###.###>
        Subject: Re: GC Feedback
      Date: Tue, 04 Apr 2006 01:57:14 +0300

      Philippe and David,

      We have not come up with any good explanation for the behavior you
      are seeing. I would suggest that you file a bug.

      This may well come back to us. If it does, we'll assign
      it to an engineer that can put some dedicated time in on
      it. It's like nothing we've seen though so will likely
      be hard to figure out.

      Jon

      THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try

      THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Yes

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      No long GCs when lots of available memory
      REPRODUCIBILITY :
      This bug can be reproduced rarely.

            jmasa Jon Masamitsu (Inactive)
            tbell Tim Bell
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: