Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8322295

Late G1 Barrier Expansion

    XMLWordPrintable

Details

    • JEP
    • Resolution: Unresolved
    • P4
    • None
    • hotspot
    • None
    • Roberto Castañeda Lozano & Erik Österlund
    • Feature
    • Open
    • Implementation
    • M
    • M

    Description

      Summary

      Garbage collectors instrument application memory accesses with additional instructions (called barriers) that register information about the accessed objects. In managed runtime environments with multiple execution modes, such as HotSpot, this instrumentation is performed not only by the interpreter but also by the JIT compiler(s). This JEP proposes simplifying the handling of G1 (HotSpot's default collector) barriers in C2 (HotSpot's optimizing just-in-time compiler), by hiding their internal structure in C2's main code transformations and optimizations. This reduces C2's execution time significantly; precludes subtle, C2-induced barrier-ordering issues by construction; and makes G1 barriers more accessible to non-C2 experts, while preserving the original performance of the JIT-compiled code.

      Goals

      • Reduce the overhead (execution time) of C2.

      • Guarantee that C2 preserves invariants about the relative ordering of memory accesses, safepoints, and barriers.

      • Make G1 barriers accessible to HotSpot developers without a deep understanding of C2.

      • Preserve the current quality of the C2-generated code, in terms of speed and size.

      Non-Goals

      • Improve the quality of the C2-generated code.

      Success Metrics

      • Improved C2 speed on a set of industry-standard Java benchmarks.

      • No significant performance regression on a set of industry-standard Java benchmarks.

      Motivation

      The increasing popularity of cloud-based Java deployments has led to a stronger focus on reducing overall JVM overhead. JIT compilation is an effective technique to speed up Java applications, but incurs a significant overhead in terms of processing time and memory usage. This overhead is particularly noticeable for optimizing JIT compilers such as JDK's C2. Reducing the overhead is key to making Java a better fit for the cloud.

      Another core JVM component that contributes to the JVM's total overhead, in addition to JIT compilers, is the garbage collector. As a semi-concurrent, generational garbage collector (GC), G1 needs to interface with the JIT compilers to instrument application memory accesses with barrier code. In the case of C2, maintaining and evolving this interface requires deep knowledge of C2 internals, which few GC engineers possess. Furthermore, some barrier optimizations require applying low-level transformations and techniques which cannot be expressed in C2's intermediate representation. These obstacles have slowed down or directly blocked the evolution and optimization of key aspects of G1. Decoupling G1 barrier instrumentation from C2 internals would enable GC engineers to further optimize and reduce the overhead of G1, by means of algorithmic improvements and low-level micro-optimizations.

      C2 represents the methods under compilation as a "sea of nodes", a type of program dependence graph that gives great freedom to the compiler to choose where to schedule program instructions in the final code. While this simplifies and increases the scope of many optimizations, it makes it hard to preserve invariants about the relative ordering of instructions. In the case of G1 barriers, which are currently treated as any other operations in the sea of nodes representation, this has resulted in complex bugs such as JDK-8242115 and JDK-8295066, and even as of today it is hard to guarantee the absence of other issues of similar nature.

      Early experiments and manual inspection of the C2-generated code have revealed that the assembly instruction sequences generated by C2 to implement barriers are very similar to the handwritten assembly implementation used by the interpreter to instrument memory accesses. This suggests that the scope for C2 to optimize barrier code is limited, and code of similar quality could be obtained if the barrier implementation details were hidden from C2 and only expanded at the very end of the compilation chain.

      Description

      Early G1 Barrier Expansion

      In the current model, C2 captures all detailed barrier operations together with the original operations of the application in its intermediate representation (IR). C2 expands the detailed barrier operations of each object memory access during bytecode parsing, the initial phase where C2 transforms bytecode into IR operations. Specific logic for G1 and C2 guides this expansion via the Heap Access API. Once the barriers are expanded, C2 transforms and optimizes all operations uniformly through the compilation chain, as depicted in the following diagram where IR<C,P> denotes an intermediate representation that is specific to collector C and platform P:

      Exposing all GC barrier details early in the IR has two potential advantages. First, the same GC barrier implementation (expressed as IR operations) can be reused for all platforms. Second, C2 can optimize and transform the barrier operations in the scope of the entire compilation unit, which can potentially improve the code quality. However, there is limited practical benefit because

      1. platform-specific G1 barrier implementations have to be provided nevertheless for other execution modes such as bytecode interpretation, and
      2. G1 barrier code operations are not very amenable to optimization, due to control-flow density, memory-induced serialization, and other factors.

      The early expansion model has three tangible and significant disadvantages, already discussed in the Motivation section: opacity to GC engineers, difficulty to guarantee absence of barrier ordering issues, and substantial C2 compilation overhead.

      Late G1 Barrier Expansion

      This JEP proposes expanding G1 barriers as late as possible in the compilation chain of C2, as outlined in the following diagram (using the same notation as above):

      The key idea is to represent memory accesses in C2's IR only by their core operations, and postpone barrier instrumentation to code emission, where IR operations are translated into actual machine code. Barrier implementations are provided in (pseudo) assembly code, which is a familiar level of abstraction for all HotSpot engineers. Crucially, these assembly implementations are already available for all platforms to support bytecode interpretation, and can simply be reused.

      Late barrier expansion can be implemented as follows. IR memory accesses generated during bytecode parsing are tagged with the information required to generate their barrier code at code emission. This information is not exposed to C2's analysis and optimization mechanisms. Instruction selection replaces abstract memory access operations with platform- and GC-specific instructions where barriers are still implicit. GC-specific instructions are required to accommodate late barrier expansion, for example making sure that the register allocator reserves enough temporary registers to perform the barrier operations. Finally, during code emission, each GC-specific memory access instruction is transformed into actual assembly code according to its tagged barrier information. The assembly code consists of the platform-specific memory instruction itself surrounded with barrier assembly code. The barrier code is generated using the existing interpreter implementation, augmented with routines ("assembly stubs") implementing potential calls from the barrier into the JVM runtime.

      ZGC, an alternative fully-concurrent collector available in the JDK, already applies the late barrier expansion model. Many of the mechanisms developed for ZGC, such as the logic to perform runtime calls in the barrier code, can be abstracted and reused for G1.

      G1 Barrier Background

      This subsection summarizes the function and structure of the G1 barriers, to make the rest of the Description section more approachable. G1 is a semi-concurrent, generational collector. To support both concurrency and multiple generations, G1 instruments each write of object y into field f of objectx (represented as x.f = y) with barrier code. More specifically, G1 surrounds a write with pre- and post-barriers, where the pre-barrier's main responsibility is supporting concurrent marking and the post-barrier's main responsibility is supporting the partition of heap regions into generations. To this aim, the pre-barrier communicates to the collector the value of x.f prior to the write, and the post-barrier communicates to the collector whether x and y are stored in different heap regions. The main structure of each sub-barrier is similar: in the worst case, both of them result in a costly runtime call into HotSpot to inform the collector of the relevant changes made by the write operation. To mitigate this cost, both sub-barriers test a number of conditions under which the runtime call can be avoided. For example, the pre-barrier tests whether concurrent marking is actually in progress: if not, there is no need to communicate the old x.f value to the collector. The post-barrier tests whether x and y lie in the same heap region, in which case there is no need to notify the collector. At a high level, a write is instrumented as follows:

      if concurrent marking is in progress, and x.f is not null, and ...:
        inform G1 about the value of x.f
      x.f <- y
      if x and y are stored in different regions, and y is not null, and ...:
        inform G1 that x and y are stored in different regions

      Atomic memory operations (those resulting from compareAndSet, compareAndExchange, getAndSet and their multiple variants in java.lang.invoke.VarHandle) also need to be extended with barriers, similarly to writes. Finally, reads derived from java.lang.ref.Reference::get() are also instrumented with pre-barriers.

      Optimizations

      To match the quality of the code generated by C2 under the early barrier expansion model, this JEP proposes a number of optimizations on top of the late barrier expansion model. The optimizations focus on write barriers, as our preliminary experiments show that these constitute around 99% of all executed G1 barriers.

      Combination of Compression and Write Operations

      HotSpot can be configured to compress object pointers (OOPs) in memory, to reduce the size of the heap. In this mode, OOPs have to be compressed/decompressed right before/after they are written to/read from the heap, whereas the G1 barrier logic expects them to always be uncompressed. Currently, C2 handles this mismatch by modeling compression and decompression explicitly as independent IR operations, and feeding the uncompressed OOP value to the barrier logic. This leads to barrier-instrumented writes as follows:

      if concurrent marking is in progress, and ...:
        inform G1 about the value of x.f
      y' <- compress(y)
      x.f <- y'
      if x and y are stored in different regions, and ...:
        inform G1 that x and y are stored in different regions

      In contrast, a (naive) late barrier expansion model generates redundant decompression operations, because it does not make the uncompressed OOP value (y) available to the logic that expands memory access operations:

      // Code emitted from compression instruction:
      y' <- compress(y)
      
      // Code emitted from write instruction (takes only y' as input):
      if concurrent marking is in progress, and ...:
        inform G1 about the value of x.f
      x.f <- y'
      y'' <- decompress(y') // y is not available here, hence it has to be recomputed as y''
      if x and y'' are stored in different regions, and ...:
        inform G1 that x and y'' are stored in different regions

      To achieve the first instruction sequence within the late barrier expansion model and save a decompress operation per write, one can define a "compress-and-write" pseudo-instruction that matches pairs of compress and write IR operations. This pseudo-instruction receives the uncompressed OOP (y) as input, and can thus be expanded during code emission to the same instruction sequence as in the early expansion model. This optimization can be implemented in a few tens of Architecture Description Language lines and leverages existing C2 platform-independent optimizations such as common subexpression elimination and algebraic simplifications.

      Barrier Simplification Based on Nullness Information

      C2 can often guarantee that the OOP to be stored in a memory write is either necessarily null or necessarily non-null. This can follow trivially from the source program or be inferred by C2's type analysis. Code emission can exploit this knowledge to simplify (or even remove altogether) post-barriers, and to simplify OOP compression and decompression when this mode is enabled.

      The early barrier expansion model applies such simplifications seamlessly by leveraging C2's generic, platform-independent analysis and optimization mechanisms. In a late barrier expansion model, the same simplifications can be achieved by explicitly skipping the emission of unnecessary barrier or object pointer compression/decompression instructions according to the information provided by C2's type system. Preliminary experiments show that around 60% of the executed write post-barriers can be simplified or removed by this technique.

      Alternatives

      The C2 compilation chain contains a few natural locations at which GC barriers may be expanded:

      • At bytecode parsing (early barrier expansion): GC barriers are expanded in the initial IR construction.

      • At macro expansion: GC barriers are expanded at the end of all platform-independent optimizations (loop transformations, escape analysis, etc.).

      • After code motion: GC barriers are expanded after platform-specific instructions are selected and scheduled, but before platform-specific registers are allocated. Today, there is no support for expansion at this level.

      • After register allocation: GC barriers are expanded between register allocation and the last C2 transformations.

      • At code emission (late barrier expansion): GC barriers are expanded when C2 instructions are translated into actual machine code.

      Each of the above locations offers a different trade-off point in terms of C2 overhead, required C2 knowledge, risk of suffering instruction scheduling issues, and required platform-specific effort. Generally, the later barriers are expanded the lower the C2 overhead is. The largest savings are observed when barrier expansion is moved from bytecode parsing to macro expansion, and then from code motion to after register allocation. All expansion locations but code emission require significant C2 knowledge, but only bytecode parsing and macro expansion require familiarity with C2's platform-independent IR. Expanding barriers at these two locations does not require platform-specific support, but on the other hand risks triggering instruction scheduling issues during C2's code motion phase. Expanding barriers at code emission (as proposed here) is the alternative with lowest overhead and the only one that does not require C2-specific knowledge. As for all other platform-dependent phases, it shares the advantage of precluding instruction scheduling issues and the disadvantage of requiring implementation effort for each platform.

      The following table summarizes the advantages and disadvantages of each phase:

      phase C2 overhead requires C2 knowledge control over scheduling platform-independent
      at bytecode parsing (current) high yes no yes
      at macro expansion medium yes no yes
      after code motion medium yes yes no
      after register allocation low yes yes no
      at code emission (this JEP) low no yes no

      An alternative dimension that could be explored is the granularity of the barrier implementation exposed to C2. However, experiments conducted for ZGC concluded that even exposing barriers as single operations in C2's intermediate representation risks causing instruction scheduling issues.

      Another alternative to achieve lower C2 overhead is to simply choose, if allowed by the application requirements, a GC using more lightweight barriers (Serial GC or Parallel GC) and/or a late barrier expansion model (ZGC).

      Testing

      To mitigate the risk of introducing functional failures, we plan to combine:

      • regular testing based on the extensive, already-available JDK test suite executed under different configurations by Oracle's internal test system,

      • new tests added during the implementation of this JEP to exercise cases that are not sufficiently covered today, and

      • compiler and GC stress testing to exercise rare code paths and conditions that may be otherwise missed.

      To mitigate the risk of performance regressions, we plan to evaluate the JEP implementation using a set of industry-standard Java benchmarks on different platforms.

      To measure and compare compilation speed and code size, we plan to use the functionality provided by the -XX:+CITime HotSpot option. To control the variation in size and scope of the compiled methods across JVM runs, we plan to measure multiple iterations of each benchmark run, and use JVM options such as -Xbatch that make each JVM run more deterministic.

      Risks and Assumptions

      The viability of the late C2 barrier expansion design has been already proven by ZGC, which has been based on this model since JDK 13. However, as with every change affecting multiple core HotSpot components (in this case, the interface between HotSpot's default garbage collector and optimizing JIT compiler), there is a non-negligible risk of introducing bugs in the implementation that may cause failures and performance regressions. To mitigate this risk, we plan to conduct internal code reviews and extensive testing and benchmarking beyond the usual activities conducted for regular enhancements and bug fixes, as outlined in the "Testing" section above.

      Imprecise card marking is a write barrier optimization that, in the current G1 implementation, avoids subsequent runtime calls for writes into different fields of the same object. Based on preliminary experimental results, this JEP assumes that imprecise card marking is not required to match the quality of the code generated by the early barrier expansion model. We plan to revisit this assumption during the implementation phase, when a more complete prototype will be available for experimentation.

      Attachments

        Activity

          People

            rcastanedalo Roberto Castaneda Lozano
            rcastanedalo Roberto Castaneda Lozano
            Roberto Castaneda Lozano Roberto Castaneda Lozano
            Erik Ă–sterlund
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: