Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-7116204

DGCAckHandler consume 80% of the heap

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: P3 P3
    • None
    • 6u29
    • core-libs
    • generic
    • generic

      J2SE Version (please include all output from java -version flag):

      6u29

      Does this problem occur on J2SE 1.4.x, 1.5 or 6? Yes / No (pick one)
      Yes

      Bug Description:

      A customer heap dump clearly shows that over 80% of the heap is being
      consumed by a private static map within sun.rmi.transport.DGCAckHandler.
      This map maps from a java.rmi.server.UID to a sun.rmi.transport.DGCAckHandler
      instance. There are 11,300,771 entries in this map consuming 880,764,432
      bytes of heap memory.

      Currently they have specified -Dsun.rmi.dgc.client.gcInterval=3600000 and
      -Dsun.rmi.dgc.server.gcInterval=3600000 to reduce DGC overhead.
      And they remove these as a troubleshooting measure. That said,
      don't see any direct link between these settings and the issue at hand
      from reading JVM source code.Instead, it seems be in the bowels of
      the JVM RMI code.

      Reading the decompiled Java 6 Update 29 code, it appears that
      DGCAckHandlers are created and inserted into the map by
      sun.rmi.transport.ConnectionOutputStream.saveObject(),
      which in turn appears to be called by sun.rmi.transport.LiveRef
      whenever writing a live Remote object to an RMI response.

      Once an entry is added to the map, it will only be removed by
      DGCAckHandler.received(UID), which is called upon receipt of
      a DGCAck from the client. A DGC ack timeout (5 minutes by default)
      will null out the non-final fields of DGCAckHandler -- but won't
      remove the entry from the list! So it would appear that if many
      disparate live references are returned to clients which don't respond
      with a DGC ack, that this map will grow without bounds.
      the JVM should do better than that, though.

      It appears that all the DGCAckHandler instances have a null objList,
      which seems to imply that all of these have timed out!

      Thought pass this along as thought this might ring some bells there.
      the anonymous Runnable passed to schedule() in DGCAckHandler.startTimer()
      should also call idTable.remove(). The code feels fragile otherwise,
      More information from the submitter:

      A customer is encountering enormous memory usage as per below within the bowels
      of DGCAckHandler. The customer in question is using Java 6 Update 29 on Solaris
      10 and seeing the issue when doing a stress test with OracleATS
      (Advanced Test Suite).
      More information from the submitter:

      We have finally gotten enough information from the customer to reproduce the issue.

      We have RMI server processes which when they discover they are overloaded do the following:
               if ( needToRedirect(...) )
               {
                  MethodServer next_server = nextMethodServer(); // MethodServer is a java.rmi.Remote interface
                  if (next_server != null)
                  {
                     ...
                     throw new ServerLoadException(next_server); // TO_DO: ponder where/how/if we might decide that no one wants this request and return no-next server or some such...
                  }
               }
      where nextMethodServer internally does:
      nextMethodServer = serverManager.getNextServer( thisMethodServerImpl );
      plus some caching of this result with a time-to-live, where serverManager is a stub to another RMI server (kind of an overseer), nextMethodServer is a stub to another MethodServer and thisMethodServerImpl is the MethodServer implementation within this RMI server process.

      It appears that either the generation of the ServerLoadException or invocation of getNextServer() is triggering DGCAckHandler instances to be created which will never be DGC ack'ed.  Given the time-to-live caching, the ServerLoadException is the issue.

      The number of leaked DGCAckHandlers appears to be exactly equal to the number of calls to ServerLoadException(), so I think that points quite convincingly to this being the cause of the issue.

      Modifying ServerLoadException to hide the MethodServer stub in a byte[] (as per attached) rather than leaving it as a simple non-transient field seems to completely workaround the issue.

      That raises the question as to what should be fixed in the RMI internals, as I don't see anything clearly wrong with the RMI usage pattern here that should trigger a DGCAckHandler leak.

            dmocek Darryl Mocek
            tyao Ting-Yun Ingrid Yao (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: