Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8328351

Enable post-mortem crash analysis with jcmd

XMLWordPrintable

    • Icon: JEP JEP
    • Resolution: Unresolved
    • Icon: P4 P4
    • None
    • core-svc
    • None
    • Kevin Walls
    • Feature
    • Open
    • JDK
    • serviceability dash dev at openjdk dot org
    • M
    • M

      Summary

      Extend the jcmd tool to provide diagnostics on a Java Virtual Machine that has terminated unexpectedly. Achieve this by a novel technique of process revival which provides a foundation for post-mortem analysis. Users will enjoy a consistent troubleshooting experience across live environments and post-mortem environments.

      Goals

      • Make the troubleshooting of a crashed JVM as interactive as troubleshooting a live JVM.
      • Enable post-mortem diagnostics on most operating systems supported by OpenJDK.
      • Reduce the cost of JDK maintenance by focusing all implementation effort for serviceability on jcmd.

      Non-Goals

      • It is not a goal to support all jcmd diagnostics in post-mortem environments.
      • It is not a goal to run Java code during post-mortem troubleshooting.
      • It is not a goal to remove legacy serviceability tools, such as the Serviceability Agent, from the JDK at this time.

      Motivation

      Serviceability is the ability of a system operator to monitor, observe, debug, and troubleshoot an application. Monitoring and observability tools allow the operator to connect to a live JVM and examine the state of the application. This includes the code of the application, such as the loaded classes and the methods which have been compiled just-in-time, as well as the progress of execution, such as the stacks of Java threads and native threads. JDK tools such as jstack and jmap produce thread dumps and heap dumps from a live JVM, while tools such as Java Mission Control let users browse threads and memory visually. Depending on how a tool connects to the JVM, e.g., the JMX protocol, the operator may also be able to troubleshoot the application, e.g., by activating more verbose logging by the garbage collector.

      In extreme scenarios, the JVM terminates unexpectedly in a way that cannot be monitored by such tools. This can occur because of buggy native code in the application or libraries, or due to bugs in the JVM itself. At termination, the JVM emits a crash report (hs_err_pidXXX.log) that contains information about the fault and the state of the application, such as the stack trace of the failing thread, a snapshot of the heap, a list of loaded libraries, etc. The operating system also saves the memory of the JVM process to a file known as a core dump. System operators use crash reports and core dumps post-mortem to gain a deepening understanding of what went wrong and to identify steps toward resolution.

      Unfortunately, the tools available for post-mortem analysis of a core dump are problematic:

      • Using a native debugger such as gdb is frustrating because it has no knowledge about the representation of JVM artifacts such as threads and objects in the memory of the crashed process. For example, if the operator identifies a Java object starting at a particular address in memory, then finding something even as basic as the class of the object means manually decoding words in the object's header, which as an extra complication can vary between JDK releases. Debugger scripts can help to automate the decoding of JVM artifacts in the core dump, but the work remains error-prone and the scripts need ongoing maintenance.

      • JDK 6 introduced the Serviceability Agent (SA). This is not a Java agent but rather a tool that can open a core dump and decode JVM artifacts automatically. SA also exposes these artifacts through an unsupported API. SA requires continual maintenance as the JVM evolves, and requires major work to support new JVM features. This causes friction for operators because the depth of information available from SA depends on which JVM features were in use.

      <code class="prettyprint" data-shared-secret="1745026963114-0.4820963513023929">jcmd</code> was introduced in JDK 7 as a lightweight tool for JVM diagnostics. It connects to a live JVM via the Attach API and can present Java-level information about the state of the application. It offers over 50 commands for displaying Java threads and objects, showing details of memory use, the state of the garbage collector, etc. However, jcmd is limited to attaching to live processes. Given its flexibility and popularity, it is appealing to enable the use of jcmd for post-mortem analysis on a core dump. This would give operators a symmetrical experience for live and post-mortem troubleshooting.

      Description

      We extend jcmd so it can produce diagnostics from the core dump of a JVM process. This will simplify the troubleshooting process for system operators, and unify the serviceability experience across live and post-mortem environments.

      Post-mortem analysis with jcmd uses a "revival" technique for diagnosing a crashed process. By using data from the core dump to recreate the process's memory image at the time of the crash, and by executing code in the JVM binary, it is possible to make jcmd's diagnostic commands work as they do in a live JVM, with no changes to the commands or their implementations.

      For example, if a JVM crash resulted in the core dump core.1234, then running:

      $ jcmd core.1234 Thread.print

      will produce the same kind of output as when jcmd is connected to a live JVM:

      Opening dump file 'core.1234'...
      2025-04-01 14:17:18
      Full thread dump Java HotSpot(TM) 64-Bit Server VM (25-internal-LTS-2025-03-30-1738352.name... mixed mode, sharing):
      ...
      "Thread-0" #34 [1183517] prio=5 os_prio=0 cpu=0.99ms elapsed=0.07s tid=0x00007ff8fc208cc0 ...
         java.lang.Thread.State: RUNNABLE
      Thread: 0x00007ff8fc208cc0  [0x120f1d] State: _at_safepoint _at_poll_safepoint 0
         JavaThread state: _thread_blocked
              at ThreadsMem$1.run(ThreadsMem.java:25)
              - locked <0x00000000fe300c98> (a java.lang.Object)
              at java.lang.Thread.runWith(java.base@25-internal/Thread.java:1460)
              at java.lang.Thread.run(java.base@25-internal/Thread.java:1447)
              ...

      jcmd in post-mortem environments

      jcmd supports 56 commands in a live JVM. 28 of them are available in the post-mortem environment:

      Compiler.CodeHeap_Analytics   Compiler.codecache   Compiler.codelist
      Compiler.directives_print     Compiler.memory      Compiler.perfmap
      Compiler.queue
      
      GC.heap_dump   GC.heap_info
      
      JVMTI.data_dump
      
      System.dump_map   System.map   System.native_heap_info
      
      Thread.print
      
      VM.class_hierarchy    VM.classes         VM.classloader_stats   VM.classloaders
      VM.command_line       VM.dynlibs         VM.events              VM.flags
      VM.metaspace          VM.native_memory   VM.stringtable         VM.symboltable
      VM.systemdictionary   VM.version

      The post-mortem environment must have the same operating system and CPU architecture as the environment where the JVM crashed.

      It is often difficult to access production servers where the JVM has crashed, so it is common to transport core dumps to developer workstations for analysis. Developer workstations typically run newer JDKs than production servers, so to facilitate analysis, it is not necessary for jcmd to come from the same JDK as the JVM that crashed. jcmd in one JDK release can revive core dumps from another JDK release as long as the JVM binary from the other release is available. For example, if the crash dump is X and the server's JVM binary is Y, then invoking:

      $  jcmd -L /path/to/Y X Thread.print

      will produce:

      ...

      jcmd can take as an argument either the name of a Java class or the filename of a core dump. Since the filename of a core dump might resemble a class name, a new -c option indicates that the argument is a core dump rather than a class. For example:

      $ jcmd -c main.program GC.heap_dump

      Reviving a core dump

      jcmd invokes a native helper program, into which the memory of the crashed process is "revived", and the diagnostic command is executed. The helper subprocess is needed to give the revived process its own address space, avoiding conflicts with the address space of the JVM running jcmd.

      The helper subprocess populates its address space from the data in the core dump. It also loads the JVM binary at the same virtual address as in the crashed process. The ability to load the JVM binary at a virtual address matching the core dump is achieved by relocating a copy of the binary to that preferred address. In turn, the relocation is achieved by copying and patching the JVM binary file.

      Platform-dependent analysis is required to locate the memory mappings to revive. The restored memory mappings include data local to the JVM, and global data such as the Java heap. The memory representing native thread stacks is restored, so references into them will resolve. There is no reconstruction of the threads as the native OS libraries knew them, as these threads are not going to execute.

      The JVM is not "live" as it was at run time. No Java code is executable, and no garbage collection occurs. However, the JVM binary is loaded at the correct address so its code can be executed. Absolute pointers are satisfied by being memory mapped in from the core dump, as are memory references relative to the running code. jcmd calls a JVM helper method to reset any state the JVM had concerning the addresses of other native libraries at the time of the crash.

      This revival technique does not require loading every native library from the crashed process. This is to enable running diagnostics when the core dump is transported to a different machine, where the same libraries are not available. These transported core dumps are traditionally tricky to set up in a debugger, often requiring native libraries to match the original machine. jcmd needs only the JVM binary which crashed, and the core dump.

      Alternatives

      • Invest more effort into maintenance of the Serviceability Agent. SA will work with all JVM features if enough time is spent on it. The SA and other alternative proposals over the years have always had duplication of effort somewhere.

      • Native debug information goes some way to providing low-level JVM diagnostics, and will remain an essential part of debugging. However, the effort and scripting needed to extract Java objects in human-readable form means it is a poor alternative to an enhanced jcmd.

      Risks and Assumptions

      • A risk of allowing jcmd to be used post-mortem is that core dumps containing sensitive information may be transferred to insecure environment for analysis. However, this is no different than existing troubleshooting efforts, so there is no new security risk.

      • A risk of the new JVM helper method (to reset native library state) is that it could be called during normal operation of a live JVM. This is extremely unlikely.

      • We assume that the post-mortem environment has a copy of the exact JVM binary in use at the time of the crash. This assumption is reasonable because it is expected in troubleshooting.

      • The set of diagnostic commands usable with the revived process is configured when jcmd is built. We assume this is acceptable because the core commands are widely applicable and well known. Additional commands that act on the revived process can be created for specific requirements and incorporated into jcmd via the JDK build process.

      Future Work

      Some jcmd commands are implemented in Java, such as Thread.dump_to_file which outputs a list of virtual threads in JSON format. Commands implemented in Java are not usable with the revival technique. However, these commands tend to be of greatest value in the live environment, not the post-mortem environment. We may investigate supporting these commands in the post-mortem environment.

      Some features of the Serviceability Agent are unavailable in jcmd. These omissions may be rectified with new jcmd commands in separate enhancements. Two examples are inspecting arbitrary Java objects (see JDK-8318026) and extracting a Java class declaration ("class dumping"). Other SA features, including its GUI and its Java API, are considered dated and of niche interest; they will have no equivalent in jcmd.

      Support more operating systems.

            kevinw Kevin Walls
            kevinw Kevin Walls
            Kevin Walls Kevin Walls
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: