Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8328351

Post-mortem crash analysis with jcmd

XMLWordPrintable

    • Icon: JEP JEP
    • Resolution: Unresolved
    • Icon: P4 P4
    • None
    • core-svc
    • None
    • Kevin Walls
    • Feature
    • Open
    • JDK
    • serviceability dash dev at openjdk dot org
    • M
    • M

      Summary

      The [<code class="prettyprint" >jcmd</code>] tool supports the monitoring and troubleshooting of a running HotSpot JVM. Extend jcmd so that it can also be used to diagnose a JVM that has crashed. This will establish a consistent troubleshooting experience in both live and post-mortem environments.

      Goals

      • Make the troubleshooting of crashed JVMs as familiar and productive as troubleshooting live JVMs.

      • Enable post-mortem diagnostics on Linux and Windows.

      • Reduce the future cost of JDK maintenance by focusing serviceability work on jcmd rather than other tools and components such as [<code class="prettyprint" >jhsdb</code>] and the underlying Serviceability Agent.

      Non-Goals

      • It is not a goal to support all jcmd diagnostics in post-mortem environments.

      • It is not a goal to run and debug Java code in post-mortem environments.

      • It is not a goal to enable post-mortem diagnostics on all supported operating systems.

      • It is not a goal to remove legacy serviceability tools and components, such as jhsdb and the Serviceability Agent, at this time.

      Motivation

      Serviceability is the ability to monitor, observe, debug, and troubleshoot an application. Monitoring and observability tools allow you to connect to a live JVM and examine the application. This includes the application’s code, such as loaded classes and just-in-time compiled methods, as well as its state, such as the Java heap and the stacks of Java threads and native threads. JDK tools such as jstack and jmap produce heap dumps and thread dumps from a live JVM, while tools such as JDK Mission Control enable you to browse memory usage and threads visually. If a tool connects to the JVM via the JMX protocol then you can also troubleshoot the application by, e.g., activating more verbose logging by specific subsystems.

      In extreme scenarios, the JVM may terminate unexpectedly in a way that cannot be monitored by such tools. This can occur because of buggy native code in the application or libraries, or due to bugs in the JVM itself. At termination, the HotSpot JVM emits a crash report file (hs_err_pidXXX.log) that contains information about the fault and the state of the application, such as the stack trace of the failing thread and a list of loaded libraries. The operating system also saves the memory of the JVM process to a file known as a core dump. You can use crash reports and core dumps post-mortem to gain a deeper understanding of what went wrong and identify steps toward resolution.

      Unfortunately, the tools available for the post-mortem analysis of JVM core dumps are problematic:

      • Using native debuggers such as gdb is frustrating because they cannot interpret JVM data structures in core dumps to display a Java-level view of application state. For example, if you determine that a Java object starts at a particular address in memory, then finding something as basic as the class of the object requires manually decoding words in the object's header. Debugger scripts can help automate the decoding of JVM data structures in core dumps, but the work remains error-prone and the scripts require ongoing maintenance since the layout of object headers changes over time.

      • The jhsdb tool, introduced in JDK 9, can open a core dump and interpret JVM data structures. It uses a HotSpot-internal mechanism known as the Serviceability Agent (SA). Other launchers that invoke the SA have been available in previous releases. Unfortunately, the SA codebase is brittle and dated; it requires continuous maintenance as the JVM evolves, and major work to expose new JVM features. (Despite its name, the Serviceability Agent is not a Java agent, i.e., a component that can alter the code of a running application.)

      <code class="prettyprint" data-shared-secret="1758059315091-0.39940772350612386">jcmd</code>, introduced in JDK 7, is a lightweight tool for JVM diagnostics. It can connect to a live JVM via the Attach API and present Java-level information about an application. It offers over 50 commands for listing Java threads, detailing memory use, examining the state of the garbage collector, and so forth. However, jcmd can attach only to live processes. Given its flexibility and popularity, it would be useful if jcmd could also be used for the post-mortem analysis of core dumps. This would give a consistent experience in both live and post-mortem troubleshooting.

      Description

      We extend jcmd to support post-mortem analysis by using the data in a core dump to recreate the JVM’s memory image at the time of the crash, and by executing code in the JVM binary to interpret the data structures in that image. This revival technique enables jcmd’s diagnostic commands to work exactly as they do in a live JVM, with no changes to the commands or their implementations.

      For example, if a JVM crash results in the core dump core.1234, then running:

      $ jcmd core.1234 Thread.print

      will produce the same kind of output as when jcmd is connected to a live JVM:

      Opening dump file 'core.1234'...
      2025-04-01 14:17:18
      Full thread dump Java HotSpot(TM) 64-Bit Server VM (25-internal-LTS-2025-03-30-1738352.name... mixed mode, sharing):
      ...
      "Thread-0" #34 [1183517] prio=5 os_prio=0 cpu=0.99ms elapsed=0.07s tid=0x00007ff8fc208cc0 ...
         java.lang.Thread.State: RUNNABLE
      Thread: 0x00007ff8fc208cc0  [0x120f1d] State: _at_safepoint _at_poll_safepoint 0
         JavaThread state: _thread_blocked
              at ThreadsMem$1.run(ThreadsMem.java:25)
              - locked <0x00000000fe300c98> (a java.lang.Object)
              at java.lang.Thread.runWith(java.base@25-internal/Thread.java:1460)
              at java.lang.Thread.run(java.base@25-internal/Thread.java:1447)
              ...
      ...

      jcmd in post-mortem environments

      jcmd currently supports 57 commands in a live JVM, of which 26 are relevant and available in the post-mortem environment:

      Compiler.CodeHeap_Analytics    Compiler.codecache    Compiler.codelist    Compiler.directives_print
      Compiler.memory                Compiler.perfmap      Compiler.queue
      
      GC.class_histogram             GC.heap_dump          GC.heap_info
      
      JVMTI.data_dump
      Thread.print
      
      VM.class_hierarchy             VM.classes            VM.classloader_stats VM.classloaders
      VM.command_line                VM.events             VM.flags             VM.metaspace
      VM.native_memory               VM.stringtable        VM.symboltable       VM.systemdictionary
      VM.version
      help

      The post-mortem environment must have the same operating system and CPU architecture as the environment in which the JVM crashed.

      It is often difficult to access production servers where the JVM has crashed, so it is common to transport core dumps to developer workstations for analysis. Such workstations typically run newer JDKs than production servers, so to facilitate analysis, it is not necessary for the jcmd tool to come from the same JDK as the JVM that crashed. The jcmd tool in one JDK release can revive core dumps from another JDK release as long as the JVM binary from the other release is available. The other release may be older or newer than the release where jcmd is running, as long as both releases are at least JDK NN. When running jcmd, the path to the JVM binary is specified via the new -L option:

      $ jcmd -L /transported_files/libjvm.so core.1234 Thread.print

      In JDK NN, jcmd can take either the name of a Java class or the filename of a core dump as an argument. Since the filename of a core dump might resemble a class name, the new -c option indicates that the argument is, in fact, a core dump:

      $ jcmd -c MyApp GC.heap_dump

      Reviving a core dump

      To revive a JVM instance from a core dump, jcmd creates a subprocess so that the revived instance has its own address space, distinct from the address space of the JVM running jcmd. It populates that address space by memory-mapping the core dump to recreate the internal data structures of the JVM, the Java heap, and the stacks of native threads, all at their original memory addresses so that pointers remain valid. jcmd also loads the JVM binary (libjvm.so) at its original memory address.

      The revived JVM instance is not live in the same way it was at run time. No Java code can be executed and no garbage collection occurs. The instance is, however, sufficiently complete that jcmd can interpret data structures in the revived instance by calling native JVM functions in the revived instance — the exact same functions it invokes on a live instance when using the Attach API. This approach makes the jcmd diagnostic code independent of whether the observed JVM is alive or dead; either way, the diagnostics call the same native functions. Thus no new code is needed to, e.g., inspect an object, extract a heap dump, or obtain a thread's stack frames.

      Aside from the JVM binary, it is not necessary to load any native library from the crashed process. This enables troubleshooting a core dump after transporting it to a different machine, where the same native libraries might not be available. For post-mortem analysis, jcmd needs only the core dump and the JVM binary which crashed.

      Future Work

      • We plan eventually to support post-mortem troubleshooting with jcmd on MacOS, in addition to Linux and Windows.

      • We expect to make further enhancements to jcmd to aid troubleshooting in both live and post-mortem environments. Two examples are new commands for inspecting arbitrary Java objects and for extracting Java class definitions ("class dumping"). We also expect to enhance some existing commands, e.g., VM.uptime, to work in both environments.

      • Some existing diagnostic commands are implemented in Java rather than in native code. This means they are not compatible with process revival and cannot be used in post-mortem environments. Of these Java commands, some are only of value live, e.g. ManagementAgent.start, JFR.start. We may investigate other commands written in Java to see if they could work in post-mortem environments. For example, Thread.dump_to_file, which outputs a list of virtual threads in JSON format, could be useful post-mortem.

      • In the future, developers of new commands will need to consider the possibility of post-mortem execution when choosing the implementation language.

      Alternatives

      • Invest in improvements to the Serviceability Agent (SA).

        SA is written in Java code. It relies on a native library (libsaproc) to return the contents of raw memory from either a running process or a core dump. This makes the SA code independent of whether the observed JVM is alive or dead, but it also means that SA must turn byte arrays returned by libsaproc into instances of Java classes that model threads, stack frames, locks, and so forth. This interpretation is tedious and intricate; it requires the maintainers of SA to, e.g., know how each garbage collector lays out Java objects in the heap. It requires the SA code to be updated continously as the JVM evolves. Finally, it duplicates the functionality of vast swathes of native code in the JVM which manage run-time data structures.

        By contrast, the new jcmd technique of reviving the memory and code of a crashed JVM instance reuses the native code that managed the JVM's data structures when the JVM was alive. Duplicating the memory of a formerly-live process is much more efficient than duplicating the code required to understand it.

        SA embraced high implementation complexity in order to support a rich Java API, but the rapid pace of JVM development has made that complexity costly to maintain and so the functionality of the API has suffered. The rich functionality of SA's Java API is, further, more than is necessary for JVM troubleshooting. Instead of SA's high-cost/rich-feature approach, jcmd with process revival takes a low-cost/core-feature approach. Since it is low cost, it can be supported over the long term.

      • Continue to rely on native debuggers such as gdb.

        Native debuggers can provide low-level JVM diagnostics, and will remain an essential part of JDK troubleshooting. However, the technical effort needed to display Java objects in human-readable form makes them a poor alternative to an enhanced jcmd.

      Risks and Assumptions

      • The set of jcmd commands usable with revived JVM instances is fixed when the JVM is built. We assume this is acceptable because the existing jcmd commands are widely applicable and well known, with additional planned commands mentioned above. Additional commands can be created for specific requirements in future and incorporated into updated versions of the JVM via the JDK build process.

      • A risk of allowing jcmd to be used post-mortem is that core dumps containing sensitive information may be transferred to insecure environments for analysis. However, this is no different than the situation today with jhsdb and SA, so there is no new security risk.

      • We assume that the JVM binary in use at the time of the crash is available in the post-mortem environment. This is reasonable as access to the correct binary is required by existing crash analysis methods.

            kevinw Kevin Walls
            kevinw Kevin Walls
            Kevin Walls Kevin Walls
            Alex Buckley
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated: