Summary
An Ahead-of-Time (AOT) object archiving mechanism, agnostic to which Garbage Collector (GC) is selected at deployment time. Currently, the Z Garbage Collector (ZGC) does not support the object archiving mechanism of the AOT cache.
Goals
- Allow all garbage collectors to work smoothly with the AOT cache from Project Leyden.
- Keep GC implementation details and policies separate from the object archiving mechanism.
Motivation
Garbage collectors such as G1 pause application threads to collect garbage. This causes tail latency, where the application's handling of some requests takes significantly longer than usual. Application developers seek to manage tail latency through Service Level Agreements that require, e.g., the P99 response time (the 99th percentile) to be below 10 ms; this means that among the worst 1% of response times, the shortest must not exceed 10 ms. To support latency-sensitive applications, the Z Garbage Collector (ZGC) was introduced in JDK 15, improving GC-induced tail latency by collecting garbage concurrently.
However, GC is not the only JVM mechanism that causes tail latency. Java applications are often "scaled out" by starting new instances to handle more requests, but requests sent to the new instance take significantly longer than requests sent to a warmed-up JVM. To address this source of tail latency, Project Leyden introduced Ahead-of-Time Class Loading & Linking. This improves application startup and warmup by caching the classes of an application so that they appear to be loaded instantaneously in production. For example, a training run of the Spring PetClinic application creates a cache of Class
objects for the ~21,000 classes loaded to handle requests; in subsequent runs, these Class
objects are mapped from the cache to the Java heap at startup, in effect loading the classes without scanning the class path 21,000 times.
Unfortunately, the way that Class
objects are cached is incompatible with ZGC. This forces users to choose between suffering GC-induced tail latency or startup/warmup-induced tail latency: If they use ZGC to reduce the former, they cannot use Leyden to reduce the latter, and vice versa. To help users avoid this painful choice, we rely on the maxim that the best way to reduce tail latency is with a systems approach, where the design of discrete components is coordinated. Accordingly, we propose to make the object cache work with all GCs, including ZGC. Users will be able to improve tail latency during startup and warmup (by using the cache) without having to select a GC other than ZGC which adds tail latency elsewhere.
Description
The object archive of the AOT cache should ideally have the following three properties:
- Instant load: The solution should appear to load objects instantly. Otherwise slow object loading times might undermine AOT cache startup/warmup optimization opportunities of using the object archive.
- GC-agnostic: The solution should work well regardless of GC selection. Otherwise, latency improvements of using an AOT cache in cloud services may get undermined by being constrained to use a GC with worse latencies. Moreover, this allows the JDK itself to ship a baseline AOT cache that works for all GCs, regardless of GC selection.
- Object shape tolerance: The solution should scale well when future AOT optimizations add more diverse object shapes into the archived heap.
The current object archiving solution of the AOT cache checks the first box, but not the last two. It does not support all GCs. As for the GCs that are supported, there are size restrictions for what objects may be archived, limiting what kind of data structures may successfully be archived. It would be desirable to have a solution that checks all the boxes.
AOT Layout Challenges
The current object archiving system of the AOT cache directly maps memory from the AOT cache file straight into the Java heap, which is efficient. However, in order for this approach to work well, the layout in the file has to exactly match what the GC expects to see at runtime. Importantly, this means that references from one object to another object must match bit by bit. The primary reason why the current object archiving mechanism is not GC-agnostic nor object size agnostic, is due to conflicting layout constraints of different GC algorithms. Only the common denominator object references that all the different GCs support, are possible to serialize.
The most notable source of reference format incompatibility is that ZGC has extra metadata bits encoded into its reference fields, used to manage concurrency. This reference metadata is only used by ZGC, hence the reference format of ZGC is fundamentally incompatible with other GCs. This is why there is currently no ZGC support for object archiving in the AOT cache.
The reference format of other the GCs is also suceptible to GC-specific heap size policies. For larger heaps (above 32 GB), reference fields are stored as 64 bit addresses. The addresses themselves depend on where in the virtual address space the heap is placed, which may vary from run to run. For smaller heaps (below 32 GB), the other GCs use a pointer compression scheme to use 32 bit reference fields. There are three different compression schemes, selected heuristically at runtime, depending on the heap size and heap location in the virtual address space. These different reference formats are incompatible with each other.
The reference format is also suceptible to GC-specific object size class policies. For example, when a smaller object refers to a larger object, some GCs require the larger object to be stored in a separate location, away from the smaller objects. The number of size classes and their boundaries depends on GC-specific policies. The consequence of these policies is another dimension to the reference formats.
All these different GC-specific reference format constraints make it challenging to load archived objects in a fashion that is efficient, GC agnostic and object size agnostic. This is why ZGC is not supported today, and large objects may not be archived.
Object Streaming
This JEP introduces a new object archiving mechanism that abstracts away the object reference layout. The fundamental insight is that checking the last two check boxes requires object references to use a more abstract logical number representation, instead of raw addresses. But in order to use the more abstract object reference format in the archive, a different solution than direct mapping memory is required to provide the illusion of instant object loading from the archive.
The mechanism chosen for this JEP is to perform object streaming from the archive. The objects are streamed in the background, while the application is running. By performing the bulk of the object loading work in a separate thread, object loading from the archive still appears to happen instantaneously to the application. Except the streaming solution offers greater freedom w.r.t. what format is used to represent archived objects. Field references may now be encoded with logical numbers, which allows checking the last two check boxes.
When the AOT cache is created, a sequence of high level object descriptors are embedded into the archived heap section of the AOT cache. These archived object descriptors contain information about what state the payload of an archived object should be. At runtime, a JVM using an AOT cache uses these archived object descriptors to materialize Java heap objects one by one. Materializing an archived object involves allocating Java heap memory, initializing it to the described initial payload and linking object references to other materialized objects, as described by an object descriptor. By materializing objects one by one, the runtime selected GC is free to use the most appropriate reference format. Meanwhile, offloading the materialization still provides the illusion of instant object loading.
Archived object descriptors are laid out contiguously in the archived heap of the AOT cache. The object descriptors describe the payload of archived objects. For simplicity, the memory layout of the archived object descriptors currently mirrors the memory layout of objects in the Java heap, except that object references are encoded as logical object indices instead of raw pointers. Consider for example a String
object with the following fields:
public class String {
private final byte[] value;
private final byte coder;
private int hash;
private boolean hashIsZero;
// ...
}
Such a String
object might have the following memory layout:
offset payload
+-----------------------------------------------+
0x0 | object header |
+-----------------------+-----------------------+
0x8 | class | hash |
+-----------------------+-----------------------+
0x10 | value | <-- object reference
+-----+-----+-----------------------------------+
0x18 |coder|hashIsZero |
+-----+-----+-----------------------------------+
Given this memory layout, the main difference between an archived object descriptor and a runtime heap object is that the value
reference field contains a logical object index such as 5
to encode a reference to the fifth object in the heap archive, instead of a raw pointer to the fifth object as preferred by a particular GC configuration. From such an object index, it is possible to find both the offset of the object descriptor in the heap archive and the object address in the Java heap, by performing a lookup in a corresponding side table. These tables are simple arrays.
Note that the memory layout of archived object descriptors is no longer entangled with the memory layout of runtime objects in the same way that it was for direct memory mapping based object archiving. Hence, the format is subject to change. A more compact format could be used in the future in order to shrink the static footprint of the AOT cache.
Offloading
A background thread starts early in the JVM bootstrapping. When the AOT cache file is opened, it eagerly starts materializing the embedded sequence of archived objects until all archived objects have been materialized. By doing this work eagerly, the bulk of the archived object materialization work is offloaded to this background thread.
Successfully offloading the object materialization work to a background thread requires archived object loading from the application to be performed in a lazy and incremental fashion. Therefore, there needs to be synchronization points before objects are accessed, where lazy materialization work can be flushed out. The first time a Class
is used, a root into the object archive is accessed. Loading of such root objects lends itself as a natural synchronization point to flush out lazy materialization of the referred to object and its transitively reachable objects.
Consider the following example class:
class Example {
public static String getMessage() { return "hello"; }
}
Without an AOT cache, when a user calls Example.getMessage()
for the first time, the Example
Class
object has to be created. It has a constant pool. The first time getMessage
is invoked, the "hello"
literal is resolved from the constant pool, causing a String
object with the "hello"
payload to be created on-demand. The resolved String
object is stored into the constant pool so that the next time Example.getMessage()
is invoked, the String
object does not need to be created again.
With an AOT cache, the first time Example.getMessage()
is invoked, there is less work to do for the application thread. The AOT cache has a root referring to the Example
Class
object in the archived heap, including its pre-resolved constant pool containing the pre-resolved "hello"
String
object. With successful offloading, these objects are already loaded from the AOT cache by the time Example.getMessage()
is invoked for the first time. Hence there is no need to create the Example
class object, and its constant pool already has the pre-resolved "hello"
String
object. In a program with ~21,000 classes with large constant pools containing lots of objects, being able to skip all this work is a considerable startup time improvement.
The background thread traverses archived objects from the archived heap roots, materializing all transitively reachable objects. In order to accelerate the traversal from the background thread, archived objects are laid out in an order matching the order expected by the traversal algorithm (depth-first search). This way, a linear iteration over the archived object descriptors yields the same visiting order as a graph traversal algorithm. This makes the traversal itself faster. More importantly, this ordering also allows partitioning the set of objects according to where in the streaming process they are:
- Objects already processed by the background thread
- Objects currently being processed by the background thread
- Objects not yet processed by the background thread
This partitioning of archived objects allows the background thread to perform the bulk of its work, without interfering with lazily triggered object streaming from the application. This is the key for providing the illusion of instant object loading. When an application thread requests an archived object in the already processed range of objects, the object can simply be looked up from a table; it is guaranteed to have been transitively streamed already. When requesting an object from the currently processed range, the application thread waits for streaming to finish, and then looks up the object. Only when requesting an object from the not yet processed range does an application thread have to perform an explicit graph traversal to ensure all transitively reachable objects are materialized. That traversal can run without expensive synchronization with the background thread, due to the boundaries where background materialization is ongoing being clearly defined.
Mapping versus Streaming
Some applications run in an environment where mapping the archive into memory will perform better than streaming objects from the archive -- and vice versa. The trade-off comes naturally from how the different mechanisms the two approaches use to give an illusion of instant object.
A warm start is when the JVM starts close in time to a previous start, such as when running a Java program over and over again. Because the AOT cache stays in the file system cache between runs, direct mapping of the AOT cache into the Java heap effectively loads the objects instantly. Streaming achieves the same illusion by offloading its work instead on a background thread, which relies on the availability of an extra core during startup.
Conversely, a cold JVM start is the first start in a while, such as when deploying JVMs in the cloud. The AOT cache is unlikely to already be in the file system cache, and the larger the AOT cache, the larger the cost of loading it from disk becomes. Streaming, however, can still hide the latency of materializing objects from the archive, which works best if there is an extra core available.
In summary, the weakness for object streaming is warm starts in constrained environments that do not have an extra core. Therefore, the direct mapping mechanism is heuristically used for AOT caches created with -XX:+UseCompressedOops
, which is never used by ZGC, and is used by other GCs for heap sizes up to 32 GB. Therefore, the object streaming mechanism is used when either ZGC is used or the heap size is larger than 32 GB. What such systems have in common is that they typically have more than a single core available. The JDK ships with one AOT cache for -XX:+UseCompressedOops
and one for -XX:-UseCompressedOops
, ensuring both mechanisms are present out of the box. However, a curious user can explicitly enable object streaming when creating an AOT cache by explicitly using the AOTStreamableObjects
JVM option.
Alternatives
Building ZGC support for the AOT cache does not require a GC-agnostic solution. One possible solution would be to double down on more GC-specific logic, and have a ZGC specific heap archive with a ZGC specific object reference layout. The main advantage of a ZGC specific solution would presumably be a more optimized starting experience for ZGC. However, the object loading cost is effectively hidden as long as there is an extra core. Therefore, the main optimization opportunity would be when using ZGC in a system without an extra core. Given the concurrent nature of ZGC, that would be an unusual environment to optimize for.
Having a ZGC tailored solution would however have some notable disadvantages:
- There is a default AOT cache shipped with the JDK. With a ZGC specific solution, this would require an extra ZGC specific AOT cache to be shipped with the JDK, inflating the size of the JDK, and hence cold JVM startup times for everyone, whether ZGC is used or not.
- Development of ZGC will be slowed down by interweaving object archiving with the core GC algorithm of ZGC, which is already rather complicated.
As for GC-agnostic object archiving, different approaches have been considered, involving bulk materializing all objects eagerly. Without offloading and lazyness, the instant object loading check box could not be ticked off.
It is not a goal to remove the existing GC-dependent object archiving mechanism. While removing the existing GC-dependent object archiving mechanism of the AOT cache would allow detangling implementation details of other GCs from object archiving, we will not consider that at this time as there is not enough data to make such a decision yet.
It is not a goal to unify AOT cache artifacts produced for -XX:+UseCompressedOops
with -XX:-UseCompressedOops
.
Testing
A large amount of object archiving tests have already been written. They will be adapted to regularly test with ZGC and the new object streaming approach.
- duplicates
-
JDK-8274789 Support archived heap objects in ZGC
- Closed
-
JDK-8242315 Execute patch_archived_heap_embedded_pointers in a GC thread
- Closed
-
JDK-8310823 CDS archived object streaming
- Closed
- relates to
-
JDK-8328886 Lilliput: Build COH archives
- Resolved
-
JDK-8308854 G1 archive region allocation may expand/shrink the heap above/below -Xms
- Open