Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8326035

Ahead-of-Time GC Agnostic Object Archiving

XMLWordPrintable

    • Icon: JEP JEP
    • Resolution: Unresolved
    • Icon: P4 P4
    • None
    • hotspot
    • Erik Österlund
    • Feature
    • Open
    • gc
    • Implementation
    • hotspot dash dev at openjdk dot org
    • M
    • M

      Summary

      An Ahead-of-Time (AOT) object archiving mechanism, agnostic to which Garbage Collector (GC) is selected at deployment time.

      Goals

      The AOT cache delivered by JEP 483: Ahead-of-Time Class Loading & Linking embeds ahead of time computed state in an AOT cache, in order to start the JVM faster. This cache contains an object archive as well as other program state. Currently, the Z Garbage Collector (ZGC), does not support the object archiving mechanism of the AOT cache, making ZGC not fully supported. This JEP aims at addressing that. The primary goals of this JEP are:

      • Support object archiving for ZGC (and indeed any other GC)
      • A unified object archiving format and loader

      Secondary goals:

      • Keep GC implementation details and policies separate from the object archiving mechanism

      Non-Goals

      It is not a goal at this time, to:

      • Remove the existing GC-dependent object archiving mechanism
      • Unify AOT cache artifacts produced for -XX:+UseCompressedOops with -XX:-UseCompressedOops

      While removing the existing GC-dependent object archiving mechanism of the AOT cache would allow detangling implementation details of other GCs from object archiving, we will not consider that at this time as there is not enough data to make such a decision yet.

      Success Metrics

      It should not take significantly longer for the JVM to start with the new GC-agnostic archived object loader, compared to the alternative GC-specific archive object loaders for Serial GC, Parallel GC and G1 GC.

      Motivation

      Traditional garbage collectors (GCs) are famous for causing “tail latency” problems in Java workloads. By pausing application threads to collect garbage, some requests take significantly longer than they usually do. Applications may have a service level agreement (SLA), requiring tail latencies to be bounded for particular percentiles. For example an SLA could say that P99 response times (the 99th percentile) must be below 10 ms, meaning that the shortest response time among the 1% worst response times should not exceed 10 ms. ZGC is a low latency GC that has been available since JDK 15 (JEP 377). It greatly improves GC-induced tail latency by performing GC work concurrently.

      However, GC is not the only JVM mechanism that causes tail latency. Java workloads are often "scaled out" by starting new instances to handle more incoming requests. Requests sent to the new instance take significantly longer than requests sent to a warmed-up JVM. This also causes tail latency. JEP 483: Ahead-of-Time Class Loading & Linking improves startup/warmup induced tail latency by capturing much of the corresponding work in an AOT cache.

      The AOT cache contains data about the state of an application from a training run. Some of this data is Class objects for all the loaded classes in the program. For example, a training run of the Spring Petclinic 3.2.0 program creates a 130 MB AOT cache which contains Class objects for ~21,000 loaded and linked classes. These objects are stored in the AOT cache and are loaded into the Java heap when the application is run again in order to make class loading appear instantaneous. Unfortunately, the object archiving mechanism used by the AOT cache is incompatible with ZGC. This is unfortunate as it forces latency conscious users to choose whether they would like their application to suffer from GC induced tail latency or startup/warmup induced tail latency.

      To reduce tail latency, it is important to use a systems approach, where all components are designed to work together. This JEP introducing a GC agnostic object archiving mechanism for the AOT cache, allowing it to be used with ZGC as well as any other GC. This way, users that wish to improve startup/warmup induced tail latency by using the AOT cache are no longer forced to select a GC other than ZGC which adds tail latency elsewhere.

      Description

      An AOT cache helps starting the program 42% faster. The startup/warmup optimizations that rely on object archiving risk being undermined by the cost of loading archived objects. Therefore, efficient archived object loading is important for the AOT cache.

      Offline Layout Challenges

      The current object archiving system of the AOT cache directly maps memory from an archive file straight into the Java heap, which is efficient. However, in order for this approach to work well, the layout in the file has to exactly match, bit by bit, what the GC (and the rest of the JVM) expects to see at runtime. There are three different layers of layout policies that might cause bits not to match. These layout concerns are:

      1. Heap layout. The heap layout is a high level strategy for where in the heap a GC chooses to place objects of a particular size and class.
      2. Field layout. The field layout is concerned with where to store contents of fields within an object. It is not GC dependent.
      3. Object reference layout. This is the bit encoding strategy, for reference fields. It varies based on different optimization goals of different GCs.

      These three layers of object layout policies can vary significantly between GC implementations and heap sizes. For each level of layout policy, there are various factors that can affect the bit pattern of how objects are represented in memory. For example:

      • There are currently six different pointer formats in HotSpot
      • There are various different heap layouts - contiguous, region based, discontiguous
      • Object alignment differs depending on object size for different GCs
      • Object location and grouping differs depending on object size for different GCs

      These low level bit anomalies make it challenging to load the archived objects in a GC agnostic fashion. That is why ZGC is not supported by the AOT cache today.

      Object Streaming

      This JEP introduces a new object archiving mechanism that abstracts away the two layout concerns that are GC dependent: heap layout and object reference layout. The fundamental insight is that object references are not stored in the archive as physical addresses but rather as logical values. When the archive is written, one object's reference to another is written not as the actual value but rather a random number. When the archive is read, the random numbers are turned back into actual pointers based on the GC in effect.

      The new mechanism archives high level object descriptors that may be used at runtime to materialize an object. The new mechanism allocates objects, initializes their payload and links objects together one by one, based on these descriptors. Loading objects in this way, is referred to as "object streaming" in this document. Archived object streaming allows any GC selected at runtime to materialize the archived objects due to using an online object layout policy.

      Archived object descriptors are laid out contiguously in the archived heap of the AOT cache. Each object descriptor has an “object index” based on the order in which the object descriptor has been laid out in the archived heap. This object index is used to describe the identity of archived objects. For example, object references between objects are encoded as object indices. There is also a table that maps object indices to corresponding materialized heap objects. There is another table that maps object indices to object descriptors in the archived heap. The tables are embodied as arrays making table lookups fast.

      Offloading

      The AOT cache has roots into the archived heap allowing it to refer to Java objects. Loading a root object from the archived heap requires streaming of all transitively reachable objects from that root. This operation requires a graph traversal which takes some time to complete. The latency of said traversal is hidden by offloading most of the object streaming work to a background thread that starts streaming archived objects early in the JVM bootstrapping. Root loading is performed lazily while the background thread is concurrently streaming objects from the archive.

      The bulk of the object streaming work is offloaded to the background thread. It traverses archived object descriptors from roots, streaming all transitively reachable object descriptors. In order to accelerate the traversal from the background thread, object descriptors are laid out in an order matching the order expected by the traversal algorithm (depth-first search). This way, a linear iteration over the archived object descriptors yields the same visiting order as a graph traversal algorithm. This makes the traversal faster. This ordering also allows defining a linear range of object descriptors currently being materialized by the background thread, for each root being transitively traversed. Archived objects can then be partitioned into three distinct partitions:

      1. Objects already processed by the background thread
      2. Objects currently being processed by the background thread
      3. Objects not yet processed by the background thread

      This partitioning of archived objects allows the background thread to perform the bulk of its work, without interfering with lazily triggered object streaming from the application. When an application thread requests an archived object in the already processed range of objects, the object can simply be looked up; it is guaranteed to have been transitively streamed already. When requesting an object from the currently processed range, the application thread waits for streaming to finish, and then looks up the object. Only when requesting an object from the not yet processed range does an application thread have to perform an explicit graph traversal to ensure all transitively reachable objects are materialized. That traversal can run without expensive synchronization with the background thread, due to the boundaries where background materialization is ongoing being clearly defined.

      Mapping versus Streaming

      However, some applications run in an environment where mapping the archive into memory will perform better than streaming objects from the archive -- and vice versa.

      Alternatives

      Building ZGC support for the AOT cache does not require a GC-agnostic solution. One possible solution would be to double down on more GC-specific logic, and have a ZGC specific object archiving mechanism that lays out objects with a heap layout and pointer layout expected by ZGC. This has some notable disadvantages:

      • There is a default AOT cache shipped with the JDK. With a ZGC specific solution, this would require an extra ZGC specific AOT cache to be shipped with the JDK, inflating the size of the JDK.
      • Development of ZGC will be slowed down by interweaving object archiving with the core GC algorithm of ZGC, which is already rather complicated.

      The main advantage of a ZGC specific solution would presumably be faster warm JVM starts with ZGC. However, from current experiments, it appears that the streaming object loader is efficient without needing to resort to that.

      As for GC-agnostic object archiving, different approaches have been considered, involving bulk materializing all objects eagerly. Without offloading and lazyness, they did impact startup times negatively though.

      Testing

      A large amount of object archiving tests have already been written. They will be adapted to regularly test with ZGC and the new object streaming approach.

      Risks and Assumptions

      • Since the bulk of object streaming is performed by a background thread, its efficiency relies on an extra core being available during startup. Some constrained cloud environments might not have an extra core. This risks delaying startup due to object streaming. However, such small deployments will likely have heap sizes smaller than 32 GB and will therefore not use a concurrent GC; the object loader of this JEP is not used for such deployments by default.

      • We assume that an application's AOT cache will be used in an environment similar to that where the cache was created. In particular, whether an application experiences warm start or cold start is critical to the effectiveness of the cache:

        A warm start is when the JVM starts close in time to a previous start, such as when running a Java program over and over again. Because the AOT cache stays in the file system cache between runs, direct mapping of the AOT cache into the Java heap is almost free, even if the AOT cache grows large. Streaming is still fast due to offloading its work, but direct mapping nevertheless has an edge for warm starts.

        Conversely, a cold JVM start is the first start in a while, such as when deploying JVMs in the cloud. The AOT cache is unlikely to be in the file system cache, and the larger the AOT cache, the larger the cost of loading it from disk becomes. Streaming, however, can hide the latency of materializing objects from the archive regardless of its size, giving it the edge for cold starts.

        You can force creation of an archive in the streaming format with the command line option -XX:+DumpStreamableObjects. Use of this option is generally unnecessary because the archive format (streaming versus mapped) is selected heuristically in the training run: if the application's heap is larger than 32GB or if ZGC is in use, the archive is written in the streaming format, otherwise the archive is written directly from the JVM's memory. We believe that the use of ZGC or a large heap size is a useful proxy for indicating that an application is deployed with multiple cores available, which in turn might indicate cloud deployment and therefore cold start.

            eosterlund Erik Österlund
            eosterlund Erik Österlund
            Erik Österlund Erik Österlund
            Ioi Lam, Stefan Karlsson, Vladimir Kozlov
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated: