Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6339175

Periodic unaccountable JVM pauses

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: P2 P2
    • 1.4.2_13
    • solaris_8
    • hotspot
    • None
    • sparc
    • solaris_8

      Operating System: Sparc Solaris
      OS version: 5.8
      Product Name: Java
      Product version: 1.4.2_09
      Hardware platform: Both Sun V240 and 480R
      Severity: 1

      Short description: Periodic unaccountable JVM pauses

      Full problem description:
      We have been experiencing periodic JVM pauses lasting for a few
      seconds after a few hours of load testing in a crucial customer
      application upgrade in their lab environment.
      We are currently running on Java 1.4.2_09 and can not make the move
      to Java 1.5 due to other Java CORBA related issues, so upgrading
      the JVM into the 1.5 stream is not an option.
       
      The application is running with the following JVM options:
       -server
       -XX:+UseConcMarkSweepGC
       -XX:+UseParNewGC
       -Xms256M
       -Xmx256M
       -Dsun.rmi.dgc.server.gcInterval=86400000
       -Dsun.rmi.dgc.client.gcInterval=86400000
       -XX:+PrintGCDetails
       -XX:+PrintGCTimeStamps
       -Xloggc://opt/redknee/product/s2100/current/log/gcstats.log
       
      We have analyzed the GC logs and all the logs reported sub-second
      collection times.

      We have periodically captured pstack/gcore analysis's and thread
      dumps of the process in an attempt to deduce what the JVM is doing
      during the paused periods.

      These dumps were taken by an external application monitoring our
      application event record logs looking for pauses in access of
      three seconds which under our load testing should not occur.

      We suspect the file reader's reading from an NFS mount might be
      at fault. If an NFS mount becomes temporarily inaccessable we
      speculate this might lock up the entire JVM and prevent the
      application threads from running.

      The application is running on a lightly loaded system in which
      the JVM is the major process running around 20-30% of the CPU
      for the node. We have tested on two separate lab hardware nodes
      and experienced issues on both.

      We would like Sun to analyze the current system patches and
      validate there are no known issues with our software configuration
      and if there are patches available for issues which pause the
      JVM unexpectently.
      The same issue was being observed on both the 1.4.2_08 and 1.4.2_09 nodes.

      The two sets of logs and core details are captured and are on titan.sfbay.sun.com at :
         /tmp/Redknee/100705_214128 and /tmp/Redknee/100805_130529
         System Information is at : /tmp/Redknee/sysinfo.txt

      Here some more details collected :
      1) A system is monitoring the application record logs. So Is that monitoring system is also a java based application ?
      * Yes this monitoring tool that we wrote is written in Java.
        It simply polled the Event Record files and read each line as they are written (on a fixed polling interval).
        When a extended period of time is passed where no ERs are recorded we would run a UNIX script using the Runtime.getRuntime().exec("scriptfilename") call.
          
      2) When we say its monitoring, we need to know what is the activity involved with monitoring system ?
            Is it a Read-only process or Read-Write process to the application system.
            Let us know how the communication is happening, means, does it involve sockets/ftp process to access the application system.
      * The monitoring tool simply reads the ER files. It does not directly communicate with the application at all.
        It uses read only permissions as we do not write to the ER files.
       
      3) We understand that the corefile details and logs sent to us, were collected at application system. Correct us if not.
      * Yes the Core files and logs were collected for the actual application system.
      So the monitoring system does use java.io and java.nio.channels.FileChannel from the FileInputStream to read the ER files. The monitoring systempolls the ERs generated by the application server which resides on the local filesystem.

      4) Detailed Information :
      The application system polls ERs of a remote system. The ER directory of the remote system is mounted using a NFS mount on the local machine andwe poll the ERs from the NFS mount. The polling library that the application system is using to poll these remote ERs is identical to that used in our monitoring system. It uses java.io.FileInputStream to get the java.nio.channels.FileChannel and we use the FileChannel to read the ERs outof the file. The issue we are experiencing is that when we simulate a NFS timeout by routing the NFS mounts IP to a non-existent IP we are seeing that the entire JVM is being paused rather than just the threads that are accessing the NFS mount. This JVM wide pause is causing significant issues with our socket connections to external applications and thus needsfurther investigation. We need to ensure that this NFS mount timeout issue is the root cause of our JVM pauses as well as a understanding of whythe entire JVM is pausing under this scenario, which we hope to gather from your analysis.

            robm Robert Mckenna
            sdattatrsunw Sreenatha Dattatri (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: