Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8337789

CPU Time based profiling for JFR

XMLWordPrintable

    • Jaroslav Bachorík, Johannes Bechberger & Ron Pressler
    • Feature
    • Open
    • jfr
    • Implementation
    • hotspot dash jfr dash dev at openjdk dot org
    • M
    • S

      Summary

      Offer a new CPU-time profiling capability for JFR on Linux to profile CPU usage. This capability is currently considered experimental and may be made permanent in a future release.

      Goals

      • Add an CPU-time-based sampling represented by a new JFR event on Linux systems.
      • Record events even when the JVM fails to walk the stack.

      Non-Goals

      • Do not include non-Java frames in the event stack traces.
      • Do not extend the implementation of this sampler to platforms other than Linux.

      Motivation

      Profiling a program means determining how its various component units consume different kinds of computational resources such as memory, computation, or elapsed time. A program profile shows what portion of a relevant resource is consumed by the different parts of the program, e.g. method foo consumes 20% of the resource, while method bar consumes only 0.1%.

      Profiling is required to make a program efficient as it guides the developer on which units to optimize. Without one, the programmer may waste time optimizing a method with little impact if the method consumes a negligible amount of resources in the program to begin with. For example, optimizing a method that takes up 0.1% of the total program execution time to run ten times faster will only shorten the program's execution time by 0.09%. Therefore, an efficient program is one that's been properly profiled and optimized accordingly.

      JFR, the JDK Flight Recorder, is the JDK's core mechanism for profiling (and monitoring) capabilities. JFR can profile a program with respect to two resources: allocation of heap memory and elapsed ("wall-clock") time. A heap memory profile shows which parts of the code allocate objects that consume a significant portion of the heap and what type of objects these are; it helps to make the program's memory use more efficient. The execution time profiler shows where different program threads spend their time.

      To capture a profile, JFR records events that statistically sample the program's activity as it consumes the relevant resource. The collection of different JFR events can be turned on or off, allowing a more chatty, higher-overhead collection of information during the development phases and a less chatty, lower-overhead collection of information in production.

      To obtain an execution profile we sample the execution stack of program threads every fixed interval, say, 20ms. The respective profile is then created by analyzing the recording of the profiling events.

      Execution time vs. CPU time

      An execution time profile determines the duration of methods in the program, identifying methods that take a long time to complete. Optimizing them will improve the program's latency.

      A method that sorts an array spends all of its time on the CPU. Its duration corresponds to how much CPU it consumes. In contrast, a method that reads from a socket may spend most of its time idly waiting for bytes to arrive over the wire. Of the time it consumed, only a small portion would be spent on the CPU. An execution profile will not distinguish between these cases.

      Even a program that does a lot of IO can be constrained by the CPU. Some computation-heavy methods may not consume much time overall compared to an IO operation — having little effect on the latency of processing a transaction — yet still amount to most of the CPU consumption, thus affecting the program's throughput. Identifying them and optimizing them will reduce CPU consumption and improve the program's throughput.

      For example, consider a program with two threads, each performing HTTP requests: one runs the tenFastRequests method that makes ten requests, sequentially, to an HTTP endpoint that responds in 10ms; the other runs the oneSlowRequest method that makes a single request to an endpoint that responds in 100ms.

      An execution profile of that program is obtained like so:

      java -XX:StartFlightRecording=filename=profile.jfr,settings=profile.jfc HttpRequestsExample client

      JFR obtains this profile by recording ExecutionSample events on a subset of the threads at a regular time interval. The event captures the stack trace of a thread at the moment the sample is taken, which

      We can view the profile using a JFR view as follows (the event shown is native-methods because most of the time is spent in native methods):

      > jfr view native-methods profile.jfr
      
                            Waiting or Executing Native Methods
      
      Method                                                          Samples Percent
      <hr />
      
      sun.nio.ch.SocketDispatcher.read0(FileDescriptor, long, int)        102  98.08%
      jdk.jfr.internal.JVM.emitEvent(long, long, long)                      2   1.92%

      If we use JMC to display the results as a flame graph, we see that oneSlowRequest and tenFastRequests take up a similar amount of execution time, as we'd expect given their similar latency:

      On the other hand, we expect tenFastRequests to take up much more CPU time than oneSlowRequest because ten method calls and ten rounds of processing input take up more CPU than one.

      Profiling CPU time rather than execution time can tell us how the CPU is utilized and is essential to improving a program's throughput.

      There are other shortcomings to the current execution profile: every sampling period only up to six randomly selected platform threads are sampled to reduce the profiling overhead at the expense of a less accurate profile, and JFR may fail to obtain a sample due to the thread being in a state that prevents a sample from being recorded, yet such failed attempts aren't recorded, which may result in a skewed profile. These shortcomings of the execution profile may be addressed in the future.

      CPU Time Profiling

      The ability to accurately and precisely measure CPU consumption was added to Linux in 2.16.12 through a timer that emits signals at regular intervals of time spent on the CPU (as opposed to elapsed, or "wall-clock" time). Most profilers on Linux utilize this mechanism to produce CPU time profiles.

      Some popular third-party Java tools, including async-profiler, employ Linux's CPU timer to produce CPU profiles for Java programs. However, to do so, such tools interact with the Java runtime through unsupported internals; such interactions may lead to process crashes.

      This Linux capability should be employed by JFR to safely produce CPU time profilers on the JDK. This functionality will allow the wide audience of developers deploying Java applications on Linux to make them more efficient.

      Description

      We add CPU-time profiling to JFR, on Linux systems only. This functionality is currently deemed experimental until sufficient experience in the field with its utility and design allows us to make it a permanent feature.

      JFR obtains a CPU profile by using Linux's CPU-timer mechanism to sample the stack of all threads regularly, every fixed period of CPU time. In addition, it records any failure to obtain stack samples, so the overall CPU portions reported correctly represent the application's CPU activity.

      The new event, jdk.CPUTimeSample is not enabled by default, and rate of sampling can be set by the user through the event's rate property in one of two ways:

      • As a time period — rate=10ms means that a sample event will be produced for each platform thread after it consume 10ms of CPU time, or
      • As an overall rate — rate=500/s means that 500 sample events will be produced every second, evenly spread over all platform threads.

      The default rate is 500/s, and in the profiling setting it is 1000/s. Higher rates will result in a more accurate and precise profile, but will increase the overhead on the program.

      The event comprises the following fields:

      • startTime: time directly before walking the specific stack
      • endTime: time directly after walking the specific stack
      • sampledThread: the thread sampled to obtain the stack trace
      • stacktrace: the obtained stack trace or null if the sampler failed to walk the stack

      This event is similar to the execution-time sampler's jdk.ExecutionSample event, and enabling the new event does not affect the execution sample event in any way, and the two can be collected simultaneously.

      The new event jdk.CPUTimeSampleLoss is emitted when samples are lossed due to internal implementation reasons, such as internal data structures becoming full, and contains the number of dropped samples in the last sampling round (lossedSamples). This event is enabled by default, but is only emitted when jdk.CPUTimeSample is enabled.

      CPU profiling may be added to other platforms in the future.

      Usage

      Enable and configure the new JFR event:

      > jfr configure --input profile.jfc --output /tmp/my_profile.jfc jdk.CPUTimeSample#enabled=true jdk.CPUTimeSample#period=9ms # create new settings
      > jcmd <pid> JFR.start /tmp/my_profile.jfc # use those settings to start a new profiling session for your application

      Or use it directly with -XX:StartFlightRecording:

      java -XX:StartFlightRecording=jdk.CPUTimeExecutionSample#enabled=true,filename=profile.jfr ...

      With the new CPU-time sampler, it becomes clear that the applications spends nearly all its time in tenFastRequests:

      > jfr view cpu-time-hot-methods profile.jfr
      
                                   Java Methods that Execute the Most from CPU time sampler
      
      Method                                                                                              Samples Percent
      <hr />
      
      jdk.jfr.internal.JVM.emitEvent(long, long, long)                                                         88  77.19%
      N/A                                                                                                       7   6.14%
      java.lang.Object.wait0(long)                                                                              2   1.75%
      java.lang.Class.forName0(String, boolean, ClassLoader, Class)                                             2   1.75%
      java.lang.Thread.currentCarrierThread()                                                                   1   0.88%

      CPU profiling may be added to other platforms in the future.

      Profiling in production

      With an appropriately configured rate, CPU profiling may exact an overhead that is low enough to collect CPU time information as the program runs in production. Such so-called "continuous profiling" is gaining popularity.

      As with every JFR event, jdk.CPUTimeSample events may be streamed as they're being recorded, even over the wire to a remote consumer, for online analysis.

      Alternatives

      Rely on external tools

      To allow external tools to employ OS capabilities while examining the Java Virtual Machine, safe and supported native APIs must be added. Such APIs expose the internal machinery of the runtime making it harder to evolve. Furthermore, to make such APIs safe, some inefficiencies may be introduced that can be avoided by an implementation of the functionality in the JDK.

            jbechberger Johannes Bechberger
            jbachorik Jaroslav Bachorík
            Johannes Bechberger Johannes Bechberger
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: