Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8350338

Cooperative JFR Sampling

XMLWordPrintable

    • Markus Grönlund
    • Feature
    • Open
    • jfr
    • Implementation
    • M
    • M

      Summary

      Improve stability for JDK Flight Recorder (JFR) asynchronous sampling of Java code execution (interpreted and compiled) by only walking the stack at well-defined code locations, i.e., safe point instructions, together with a solution to the inherent safepoint bias problem.

      Goals

      • Increase system stability by avoiding guessing sender frames from arbitrary code positions, resulting in a safer interplay between sampling and modern concurrent GCs, such as ZGC.

      Non-Goals

      • Solve every issue related to safepoint bias since it involves a much greater effort and platform-specific support code routines, including frame layout restructuring and describing layouts with metadata.
      • JFR asynchronous sampling only samples Java code execution (interpreted and compiled), so profiling native (C/C++) code in the HotSpot VM or other native libraries is not a goal.

      Success Metrics

      • No crashes when JFR is running with concurrent GCs, e.g., ZGC.

      Description

      This JEP describes a new approach to sampling in the Hotspot JVM, primarily targeting the JDK Flight Recorder (JFR) system. We call this new approach Cooperative JFR sampling.

      It combines the safety properties of performing stack walking only at well-defined code locations with a solution to the safe point bias problem by letting a pair of threads cooperate to produce a sample representing program execution. First, we describe some of the current sampling system's workings, challenges, and drawbacks to provide motivation and background. Then, we detail the new approach's design and high-level implementation details and discuss how it resolves existing problems and offers new capabilities.

      Definitions and Terms

      • Sampling is an asynchronous operation involving an out-of-band interrupt of a thread executing a synchronous code stream. A sample is taken at the code position where the thread was interrupted. Sampling is a statistical inductive technique for understanding program execution.
      • Current or classical sampling refers to the existing JFR sampling system, in which a single sampler thread performs all work while the sampled thread remains suspended.
      • The new approach detailed in this document is called Cooperative JFR sampling. It's a collaborative technique where threads work together to produce an event, a sample representing an asynchronous interrupt. The term cooperative emphasizes collaboration between threads and contrasts it with the current sampling system, which is non-cooperative.
      • A safepoint is a well-defined and well-described position in the code stream where the JVM makes it safe for the thread to pause execution. A code position classified as a safepoint implies the thread stack is safe for walking at that position. Safe for walking means that all frames on the stack are consistent and the metadata correctly describes the frame layout.
      • Unparsable means that a piece of code does not have enough metadata to determine its stack frame size, making it impossible to traverse the stack frames associated with that code. Code classified in this category relates to hand-rolled assembly support routines, such as stubs and blobs.
      • Accuracy means the precision of a sampling system describing a code position.
      • Crash protection is a safety mechanism JFR uses to prevent system crashes during sampling due to the risks of walking stacks from arbitrary code positions. It is implemented in OS platform-specific ways using Structured Exception Handling (SEH) and setjmp/longjmp constructs.
      • Leaf frame means the current stack frame of an interrupted thread.
      • Last Java Frame (ljf) means a thread's last, or current, Java frame when transitioning to a non-Java execution state. It is saved both for inspection by garbage collectors and for when the thread returns to execute Java code.
      • SafepointBlob is a support routine that handles branching and calling into the JVM to process safe point requests. The routine creates a stub frame on the stack and sets the Last Java Frame (ljf).
      • current.real_fp() denotes the frame pointer (fp) of a frame. For compiled frames, tt is equivalent to the sender (caller) frame's stack frame pointer (sp).

      Motivation

      JVMs typically only describe a small subset of all code instructions as safe for execution pause and inspection. These instructions are termed safepoints and play a central role in bringing the JVM to a secure state. A secure state traditionally means a well-known and well-described global JVM state in which automatic memory management can occur. By design and tradition, these code positions are the only places where stack walking is guaranteed to work, and many system invariants only hold for these locations. A sampling system mainly concerned with safety only allows stack walking at safe point code positions. The problem with this approach is termed the safe point bias problem, an inherent problem where a system must trade accuracy and representativeness for safety. A system with severe safe point bias no longer describes hot sections of an executing program but merely reflects the code positions that contain safe point instructions. For a sampling system to better represent an executing program, it has historically been necessary to employ techniques to solve the safe point bias problem in one way or another, including walking stacks at non-safepoint code positions.

      When porting JFR from the JRockit JVM to the HotSpot JVM, one of the main challenges was implementing the ExecutionSample event. This event represents a code position together with a stack trace, providing users with an indication of what Java code is hot, that is, used the most, indicating where to focus troubleshooting and optimization work. While the HotSpot JVM had an asynchronous profiling API, AsyncGetCallTrace, it is unsupported. Another drawback is that it builds on POSIX SIGPROF, which limits its universal applicability because non-POSIX platforms, most notably Windows, lack this capability. Because of these drawbacks, the current JFR sampling system takes a different approach: It features a single sampler thread and uses platform-specific suspend/resume mechanisms. In contrast to AsyncGetCallTrace, a dedicated sampler thread performs all sampling work while the sampled thread remains suspended.

      The HotSpot JVM was never designed to walk stacks from arbitrary code positions. For example, a compiled frame representing a compiled method describes its stack frame size as a numerical constant. But this constant is only valid for a subset of that method's instructions, i.e., its safe point code positions. Other instructions frequently make use of push/pops at will. Yet another problematic situation is when the JVM executes code in the interpreter. Interpreter frames are defined and well-described using a schema anchored to the frame pointer register. However, numerous instances exist where interpreter code creates frames that do not adhere to this schema. These problems arise when inspecting code in the JVM for which some metadata description exists.

      An even more challenging problem from a sampling perspective is that the HotSpot JVM depends on using various support code routines, termed stubs and blobs. These are hand-rolled, platform-specific assembly code routines with no well-defined schema or structure lacking metadata for describing their stack frames. Because this code so often acts as leaf frames, interrupts that land in such a stub or blob are unparsable by design. These sample points are lost because they cannot reflect on program execution. The problem described for code stubs and blobs is exacerbated because some heavily used, essential Java methods are implemented in the JVM as intrinsics, which are internal, optimized method representations. Most intrinsics are implemented using assembly code stubs and blobs and lack metadata. Consequently, heavily used JVM intrinsics are invisible to a sampling system.

      Sampling can interrupt code at any instruction. But as already mentioned, the JVM only guarantees stack walking to be safe for a specific subset of all instructions. All Java frames on the stack except the interrupted leaf frame are safe because they all represent a call site, which is always a safe point instruction. Hence, the problem can be reduced to deriving the caller frame for the interrupted leaf frame. The challenges are that the leaf frame may be in an inconsistent state or the interrupted instruction is not a safepoint instruction. If the leaf frame represents a compiled or interpreted method (determined by the program counter) that includes metadata, it is possible to guess where the caller frame is located. This guessing is reified in the Hotspot JVM as a battery of safety checks grouped under the term safe for sender. It attempts to determine if moving to the sender (caller) frame from the current position is safe. No guessing is possible if the code associated with the frame lacks metadata.

      Although helpful, it was evident early on that these safety checks were insufficient for JFR. The state combinations are endless, changing and evolving with new technologies. Since JFR is designed and promoted as a technology for continuous use in production systems, safety and stability are top priorities. The high risk involved in attempting to walk a stack from an arbitrary code location, even should a set of relatively extensive safety checks pass, led us to introduce the JFR crash protection mechanism to avoid systems crashes caused by faulty sampling attempts.

      These safety checks also have other drawbacks: performance decreases as the number of checks increases because the sampled thread remains suspended. They also veer on the safe side because they aim to avoid crashes. This means perfectly valid code positions are sometimes skipped because the risk is too high. An example is that the frame prologue of a compiled method must be considered unparsable if the program counter (pc) is below the instruction described by its frame complete offset, part of metadata.

      Currently, safety and stability are ensured by the JFR crash protection mechanism. However, with modern concurrent GCs, we may have hit the end of the road for how much the crash protection system can keep us from trouble. There are now more subtle problems to consider. One class of problems relates to concurrent class unloading. Even when stack walking is wrapped inside crash protection, it is still possible to misread data being unloaded. This was not an issue previously when class unloading only happened at well-defined moments, non-concurrently with JFR sampling. Crash protection will not help in these situations because an object can be safely read (under protection) but unloaded when accessed later outside protection, a result of a misread caused by wrongly guessing the location of a sender frame.

      Finally, there are some additional drawbacks of the current sampling system:

      1. Having a single thread do all the work related to stack walking does not scale with more cores and threads.
      2. It is restricted because it is impossible to determine what resources a suspended thread has acquired upon suspension. The canonical example concerns the malloc lock. The sampler thread cannot use any dynamic memory because the suspended thread could be the owner of this lock, resulting in a deadlock. This restricts the number of stack frames captured to a predefined global constant because structures must be pre-allocated.

      Design

      Many new technologies have been introduced in the HotSpot JVM in later years. One is per-thread controlled safe pointing. In contrast to traditional safe point mechanisms, the poll page of each thread can be armed/disarmed individually. This gives per-thread granularity for safe point operations, providing a foundational component used by a general framework termed thread-local handshake. Reading or checking a thread's poll page, 'polling,' is what we mean by the safe point check or poll instruction. It is done at well-known places in the JVM, most notably at the end of compiled methods (poll return) and at loop headers. The poll page is also read at well-defined transition points, such as when the thread transitions from the JVM to Java code. If the poll page is armed, the thread branches from its normal execution and continues in a safe point handler routine.

      Because poll pages can be controlled on a per-thread basis, we can quickly build a very safe, although heavily biased, sampling system only by arming the poll pages for individual threads. A significant benefit of this approach is that stack walking becomes safe by design because the thread walks its stack only at the traditional, well-defined safe point code positions, where the stack is guaranteed to be consistent. Another advantage of this design is that it prevents the need to perform extensive, albeit insufficient, safety checks. The major drawback is that this is the very definition of the safe point bias problem. The resulting system would be safe but unrepresentative.

      What if we could use this safe mechanism but with high sampling accuracy?

      This entails finding a solution to the safe point bias problem, premised on the assumption that walking will only be performed when the stack is consistent. Our solution to this problem combines aspects of the current JFR sampling system. Instead of letting the sampler thread perform all the required work while the sampled thread remains suspended, we can let the two threads cooperate to accomplish a common task as a pair. We let the JFR sampler thread continue to suspend a thread, just like before. But instead of trying to guess how to perform a full stack trace from an interrupted arbitrary code position, which is inherently unsafe and time-consuming, it only takes a snapshot of the program counter (pc) and the stack pointer (sp) for the top Java frame. This snapshot is inserted into a thread-local queue associated with the sampled thread. Before letting the sampled thread resume execution, its safe point poll page is armed.

      On resuming execution, the sampled thread executes its regular code stream until it hits the next safe point poll instruction, which can be in one of several places:

      • In a loop header
      • On method return
      • On thread transition

      Because the poll page is armed, the thread branches into a safe point poll page handler routine. We will extend this routine to check for enqueued JFR sample requests. If any, the thread reconstructs a stack trace representing the sampled state from the snapshot provided.

      This is a central point: in collaborating to complete this task, the sampler thread only describes the sampled code position. Thus, we avoid unsafe operations involved in guessing a caller frame. The sampled thread later reconstructs the sampled state at a safe and well-defined position, where the stack is guaranteed consistency.

      High-level implementation details

      We describe the sampled execution state using a quadruple, termed 'JFR sample request':

      • (pc, sp, bcp, ticks) ⇿ JFR sample request

      When the sample request represents a compiled method and frame, i.e., JIT code, it denotes the following:

      • pc = instruction pointer inside nmethod
      • sp = stack pointer of compiled frame
      • bcp = null (only for interpreter frames)
      • ticks = time of sample

      When the sample request represents an interpreted method and frame, it denotes the following:

      • pc = Method*
      • sp = interpreter frame pointer (fp)
      • bcp = bytecode instruction pointer
      • ticks = time of sample

      With this description, the thread can reconstruct the sampled state inside the safe point poll page handler routine, provided that the safe point system works correctly, i.e., poll instructions are placed correctly. The proper placement of poll instructions becomes the sampling infrastructure, where sampling accuracy becomes a function of how well the system compensates for safe point bias. An invariant of this infrastructure is that no sampled frame may escape to any of its callers by either regular or exceptional unwind. The reconstructed frame is the top frame submitted to JFR for registering a stack trace.

      Let's look at an example where a sampled thread continues execution after being sampled. In this example, the sampled thread calls four additional methods before hitting the next safe point poll check at method return.

      Evaluating the poll check, the thread branches into a stub routine called SafepointBlob. This routine constructs a large SafepointBlob stack frame to save the CPU context, i.e. registers onto the stack. It then sets the Last Java Frame (ljf) for the thread and calls into the JVM.

      In the JVM, inside the safepoint poll handler, the thread creates an iterator and walks its stack, starting from its Last Java Frame (ljf), skipping frames below the stack pointer (sp) described in the JFR sample request.

      Once the sampled frame is found, a new frame is constructed, and its program counter (pc) is adjusted to the pc recorded in the JFR sample request, adjusting for safepoint bias. The newly created and adjusted frame becomes the top frame submitted to JFR for registering a stack trace.

      The sampled stack trace reconstruction algorithm:

      1. Construct a StackFrameStream iterator from the Last Java Frame (ljf).
      2. While (current_frame.real_fp() <= request. sp), step to the next frame.
      3. Construct a new frame from the located frame and adjust its pc to the pc described in the JFR sample request (safepoint bias correction)
      4. Record a stack trace with JFR using the safepoint biased adjusted frame as a top frame.
      5. Write an ExecutionSample event to represent the recorded stack trace.

      Problems and challenges

      The new approach is a significantly better sampling system in many respects. It would be great if things were arranged so neatly that it could just be applied everywhere. Unfortunately, the real world is a more complicated place. One problem the new approach does not solve concerns sampling when the thread is not executing Java code, as is the case when the thread is executing platform-specific code located in a third-party library. The thread is already at a safe location with a Last Java Frame (ljf); the code position is not the problem in this context. The problem is that we can never know when, if at all, the thread will perform a safe point poll check because this happens only when the thread attempts to reenter the JVM, which in theory could amount to never. This is important because sample requests could be pending for a long time or, worse, never processed. In JFR, we want events delivered promptly, so we need an alternative mechanism for these situations to ensure timely sample delivery. We suggest continuing with the same solution the current sampling system uses by letting the sampler thread perform the stack walk and write events for these thread states. A benefit of the existing solution is that the sampled thread running native code does not need to be suspended to be sampled. Scaling out the number of sampler threads can address potential scalability concerns arising in the future.

      Special handling for interpreter

      To solve the safe point bias problem, the interpreter needs special handling. Instead of performing a safe point poll check when dispatching each bytecode, which can cause overhead, an alternative solution is inserting a method return safe point poll check, like it's done for compiled methods. We can accomplish this cleanly by leveraging the safe point poll instruction located in the remove activation routine of the interpreter code. The necessary modification involves pre-emptively moving the frame pointer (fp) to the caller (sender) frame before issuing the safe point poll check. Because the template interpreter code is platform-specific, this modification must be made for all supported architectures. The new approach also needs additional safe point poll instructions for deoptimization and On-Stack Replacement (OSR). Special handling is required when processing a JFR sample request representing an interpreter frame. Here, the request is the top frame, and the frame to reconstruct is the caller (sender) of the top frame.

      Benefits

      1. In the older system, sampled instructions inside a compiled method's prologue were considered unparsable because the frame setup is non-atomic. It was impossible to determine how many words were on the stack before the program counter was more than or equal to the frame complete offset, described by metadata. With the new approach, all instructions inside compiled methods, including the prologue and epilogue, are accounted. This is because frames are always complete when reconstructed.

      2. The time the sampled thread remains suspended becomes comparatively short. Previously, the sampler thread had to set up crash protection and execute a relatively extensive battery of safety checks, which, even if they passed, are insufficient to answer questions about stability and safety. We can now skip crash protection and safety checks altogether because stack walking happens at well-defined locations. Suspension time becomes a function of how long it takes for the sampler thread to construct a JFR sample request representing the top Java frame.

      3. Several restrictions are removed. Previously, the sampler thread had to create a full stack trace and send an event under restrained conditions. It could not use dynamically allocated memory because the sampled thread remained suspended. These restrictions dissolve because the sampler thread does not need to allocate memory or take any locks to describe a JFR sample request. JFR can now fully support per-event type configurable stack depths, also for the Execution Sample event type. 

      4. The new approach implies several improvements to scalability. The work performed in the critical path of the sampler thread is minimized, with the heavy lifting shifted onto the sampled threads. This is important because reducing the work performed by the sampler thread makes it feasible to replace it with other mechanisms, such as hardware events. Minimal work that does not require critical resources lends itself well to being performed inside a signal handler.

      5. Another benefit is that the sampled thread writes its own Execution Sample event. This fits well with future JFR feature work that involves transactional contexts. One solution for implementing contexts involves relating a contextual event to other events produced by the thread, including Execution Sample events.

      6. A feature that has been in the conceptual stage for a long time almost implements itself with the new approach. This feature measures how long individual threads take to reach their next safe point poll instruction, providing insight into an essential aspect of safe point latency. Visibility and measurements in this area are crucial when designing and implementing JVM policies about where poll instructions are best located. We get this feature almost for free, and to represent it, we introduce a new event, SafepointLatency. The ticks field in the JFR Sample Request describes the time the sampler thread interrupted the thread. Another timestamp happens when the thread enters the safe point poll page handler routine. This duration becomes an approximate measurement of the time it took for the thread to reach its next poll instruction from a specific program counter. A benefit is that stack traces are identical between the Execution Sample and the SafepointLatency events, meaning they can share the same stack trace identifier. The latency event can be throttled and filtered (using the threshold setting) like any other event. The SafepointLatency event is designed primarily for HotSpot VM developers and is disabled by default.

      Alternatives

      No practical alternatives exist that will allow us to remove JFR crash protection and the battery of safety checks without the risk of accidentally reading a Method* while it is unloading.

      Testing

      Since this is an implementation change, existing tests (unit, integration, stress) will be sufficient to verify equivalent functionality.

      Although relatively small, there are platform-specific (OS/CPU) changes, mainly targeting the TemplateInterpreter, so architectural porting work and testing will be necessary.

      Risks and Assumptions

      A basic assumption is that the safepoint polling system works correctly, i.e., poll instructions are placed correctly. The proper placement of poll instructions becomes the sampling infrastructure. An invariant of this infrastructure is that no sampled frame may escape to any of its callers by either regular or exceptional unwind. Extensions to the safepoint polling system could be necessary by introducing additional safe point poll check instructions.

      If the system cannot parse a leaf frame, e.g., an unparsable code blob or code stub, because of a lack of structure and metadata, the failure mode will fall back onto a biased sample (using the Last Java Frame (ljf) with no pc adjustment). Therefore, some safepoint bias will remain in the first release but with a plan to address these shortcomings systematically and iteratively in the future.

      The project aims to increase platform stability, but like any new system, it could have bugs. Removing the existing JFR crash protection construct could add risk to the system.

      Since an asynchronous sampling system can interrupt any instruction, only extensive stress tests can weed out corner cases.

        1. Figure3.png
          Figure3.png
          82 kB
        2. Figure2.png
          Figure2.png
          79 kB
        3. Figure1.png
          Figure1.png
          56 kB

            mgronlun Markus Grönlund
            mgronlun Markus Grönlund
            Markus Grönlund Markus Grönlund
            Erik Gahlin, Vladimir Kozlov
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: