Implement support for multiple generations in the Z Garbage Collector (ZGC). Having multiple generations will expand the number of Java workloads where ZGC will be a good fit. Java programs with higher allocation rates and larger live object sets will in particular be more likely to get sustained latency benefits from using ZGC.
Java applications running with a version of ZGC that supports multiple generations should experience:
- less risk of Java thread allocation stalls
- lower Java heap memory overhead
- lower CPU usage
- significantly higher sustained object allocation rates
The above should be experienced without significant throughput reduction compared to the current version of ZGC. The following properties of ZGC should be preserved:
- pause times should not exceed 1 millisecond
- heap sizes from a few hundred megabytes up to many terabytes should be supported
- minimal manual configuration should be needed
As an example of the last point, there should be no need to manually configure:
- the size of one or more generations
- the number of threads used by the garbage collector
- for how long objects should reside in the young generation
- It is not a goal to keep a version of ZGC without support for generations
- It is not a goal to support reference processing in the young generation
More Java applications with strict heap size limits and/or high allocation rates can run with ZGC without hitting allocation stalls, and at the same time not have throughput performance degraded significantly.
ZGC is a garbage collector designed for low latency and scalability, which has been available for production use since JDK 15 (see JEP 377).
The majority of the garbage collection work is performed while the Java application threads are running and the garbage collector only pauses the application threads briefly. At this time ZGC's pause times are consistently measured in microseconds. This can be compared to G1, the default garbage collector, which has pause times ranging between milliseconds and seconds depending on workloads and configurations. These low pause times are accomplished for all Java heap sizes: Java workloads can use heap sizes from a few hundred megabytes all the way up to multi-terabytes and still reap the benefits of low pause times.
For many workloads, simply using ZGC is enough to solve all latency problems caused by the garbage collector. This works well as long as there are enough resources (memory and CPU) available on the machine to ensure that ZGC can reclaim memory before the Java threads (which are running at the same time) run out of memory. However, ZGC currently stores young objects (newly allocated objects) and old objects (objects not removed during the lastest GC) together, forcing it to collect all objects at the same time. Old objects tend to stick around, and collecting them typically requires more resources and yields less memory. Conversely, young objects tend to die young, and collecting only young objects requires less resources and yields more memory. This tendency is commonly referred to as the weak generational hypothesis.
Generational ZGC changes how ZGC stores objects in the Java heap. Young and old objects are stored in separate and distinct areas of the Java heap, so called generations. Generations give ZGC the ability to separately collect only the profitable young objects. This allows Generational ZGC to use considerably less resources (CPU and memory) to operate, enabling it to solve the GC latency problem for more Java workloads. Some Java workloads might even experience a throughput increase when using Generational ZGC (due to the much lower resource usage). As an example: when running an Apache Cassandra benchmark, Generational ZGC manages to run with a fourth of the heap size and still gets four times the throughput compared to non-generational ZGC (while still not having pause times longer than 1 millisecond).
Generational ZGC splits the heap into two logical generations: one for newly allocated objects, the young generation, and one for long-lived objects, the old generation. Each generation is collected independently of the other, and all garbage collection occurs concurrently while the Java application is running. Only a very small amount of ZGC's work requires the threads of the Java application to be paused, and then only for very short periods (less than a millisecond).
With concurrent garbage collection the GC's threads are reading and modifying the Java object graph at the same time as the threads of the Java application. To avoid corrupting the heap and give the Java application a consistent view of the object graph, all threads need to collaborate. ZGC handles this collaboration by using colored pointers, load barriers and store barriers.
When ZGC is used, a field's reference to an object is implemented within the JVM as a colored pointer. A colored pointer is a pointer to a Java object in the heap, plus extra metadata to encode the currently known state of the object (known to be alive, address is correct, etc.). ZGC always uses 64-bit object pointers and can therefore fit both metadata bits and object addresses for heap sizes up to many terabytes.
A load barrier is a piece of code injected by ZGC into the application at the places where the application loads a non-primitive field of an object. That is, ZGC intercedes every time the application reads a field of an object that refers to another object. The load barrier first checks the reference and potentially performs some action (based on the state of the reference) before the application is allowed to use the referenced object.
In addition to load barriers, Generational ZGC uses store barriers to aid its work. Store barriers are code injected by the GC at stores to Java object fields. Generational ZGC's store barriers work with colored pointers to efficiently keep track of references from objects in one generation to objects in another generation. Generational ZGC introduces new metadata bits into colored pointers so that the store barriers can determine if the field being written to has already been recorded as potentially containing an inter-generational pointer. Colored pointers make Generational ZGC’s store barriers more efficient and result in lower overhead compared to traditional generational store barriers.
The addition of store barriers to ZGC changes the way Java application threads help ZGC with marking of objects (figuring out the objects that are reachable and therefore not eligible to be reclaimed by ZGC). The marking support moves out of ZGC's load barrier and into the store barrier. Generational ZGC's store barrier uses the metadata bits in the colored pointers to efficiently determine if the object previously referred to by the field (prior to the store) needs to be marked. Moving the marking responsibility out of the load barriers makes it easier to optimize them, which is important because load barriers are often more frequently executed than store barriers.
With marking support moved into the store barriers, the load barriers are only responsible for updating object pointers after ZGC has relocated objects. The load barrier will update the pointers to refer to the addresses of the relocated objects. Even if an object was not relocated, the load barrier will update the colored pointer to indicate that it is now known to point to a correct object address. Subsequent load barriers will see this color and will not check again if the object has been relocated.
The young and old generations will use their own sets of marking and relocation metadata bits in the colored pointers. This allows the generations to be collected independently.
The following sections describe important design concepts that distinguish Generational ZGC from non-generational ZGC and other garbage collectors:
- Removal of multi-mapped memory
- Optimization of store and load barriers
- Double-buffered remembered sets
- Collections without pre-reserved heap memory
- Dense heap regions
- Large objects
- Full garbage collections
Removal of multi-mapped memory
Non-generational ZGC used multi-mapped memory as a way to lower the overhead of the GC barriers. See https://wiki.openjdk.java.net/display/zgc/Pointer+Metadata+using+Multi-Mapped+memory . Generational ZGC replaces this technique with explicit handling in the GC barriers.
For the users, the main advantage is that this makes it easier to measure the amount of heap memory used by Generational ZGC. With multi-mapped memory, the same heap memory is mapped into three separate multiple virtual address ranges, and the used heap memory numbers are reported as triple the amount of the actually used heap memory.
For the GC implementation, it allows an increase in the number of metadata bits, without having to lower the max heap size.
The metadata bits are now free to move out of the limited part of the pointer that corresponds to the user-process accessible virtual memory range, which is used by the Java heap. Not only does this allow the retention of the maximum heap size when adding more metadata bits, it even opens up the possibility to increase the maximum heap size beyond the 16 terabyte limit of the non-generational ZGC.
The Java heap still represents a field’s reference to an object with a colored pointer, but the references used by Java application code are stripped off the metadata bits and become colorless pointers. That is, when application code uses object references on the JVM stack and in JVM local variables to call methods and pass arguments, those references are implemented by colorless pointers on the hardware stack or in CPU registers. The transitions back and forth between colorless and colored pointers are handled by the load and store barriers.
Since colored pointers are never used when the Java code reads an object’s fields or calls its methods, a more exotic colored pointer layout can be implemented as long as the conversion between colored and colorless pointers can be done efficiently. The colored pointer layout used by Generational ZGC puts the metadata bits in the low-order bits of the pointer and the object address in the high-order bits. This minimizes the number of CPU instructions in the load barrier. With careful encoding of the object addresses and metadata bits, a single shift instruction (on x64) can be used to both check if the colored pointer is “good” (no additional action needed) and to remove the metadata bits.
Optimization of store and load barriers
With the introduction of store barriers, and new responsibilities in the load barriers, more GC code will be inter-mixed with the Java application code. GC barriers need to be highly optimized to minimize the throughput reduction they incur. Many of the design decisions of the Generational ZGC revolve around the chosen colored pointer scheme and accompanying GC barriers.
Some of the techniques used to optimize the GC barriers are:
- Fast-paths and slow-paths
- Minimizing load barriers
- Remembered set barriers
- SATB marking barriers
- Fused store barrier checks
- Store barrier buffers
- Barrier patching
ZGC splits the GC barriers into two parts. The fast-path contains the check to determine if some extra GC work needs to be performed before the referenced object can be used by the Java application. The slow-path contains the code that performs the extra GC work. All object accesses run the fast-path check. As the name indicates, it needs to be fast, so this code is added directly in the JIT (just-in-time) compiled Java code. The slow-path is only taken a fraction of the times. When a slow-path has been taken, the color of the accessed object pointer is changed so that subsequent accesses to the same object pointer does not trigger a slow-path again for a while. Because of this, it is less important that the slow-paths are highly optimized. To balance out performance and maintainability, they are implemented as C++ functions within the JVM.
In non-generational ZGC, this was how the load barriers were split. In Generational ZGC, the same scheme is applied to the store barriers and their associated GC work.Minimizing load barrier responsibilities
In non-generational ZGC the load barrier was responsible for:
- Updating stale pointers to objects relocated by the GC
- Helping the GC mark reachable objects as live - the Java application loaded the object, so it should be considered live
In Generational ZGC, it is necessary to keep track of two generations and handle the conversion between colored and colorless pointers. To minimize complexity, and allow optimization of the load barrier fast-path, responsibility for GC marking is shifted to the store barriers.
The load barriers are responsible for:
- Removing metadata bits from colored pointers
- Updating stale pointers to objects that the GC relocated
The store barriers are responsible for:
- Adding back metadata bits to create colored pointers
- Maintaining the remembered set - which tracks old-to-young generation object pointers
- Helping the GC mark reachable objects live
When Generational ZGC collects the young generation, it only visits the object graph of the objects in the young generation. However, objects in the old generation could have fields pointing to objects in the young generation (old-to-young generation pointers). These fields must be visited for a few different reasons:
GC marking roots - They could contain the only reference keeping parts of the young generation object graph reachable. The GC needs to treat these fields as roots of the object graph to ensure that all live objects are found and marked alive.
Stale pointers in the old generation - the young generation collection moves objects, but pointers to those objects are not eagerly updated (remapped). Instead the pointers are lazily updated when the Java application encounters them. At some point, the GC needs to update the stale old-to-young generation pointers that the Java application didn't encounter.
The set of old-to-young generation pointers is called the remembered set. The remembered set describes all addresses in the old generation that potentially contain pointers to objects in the young generation. The store barrier is responsible for adding entries to the remembered set. Whenever a store is made to an object field, it is considered to potentially contain an old-to-young generation pointer. The store barrier slow-path filters out stores to fields in the young generation, since only addresses in the old generation are of interest. The store barrier does not filter based on the value written to the field which could be a value referring to either the young or old generation. It will be up to the GC to check the current value of the object field when it is time to use the remembered set. This is done to support the act-once property of the barriers.
The act-once property of the remembered set maintenance part of the store barrier means that, between the start of two consecutive young generation marking phases, the store barrier slow-path will only be triggered once per stored-to object field. The first time a field is written to, the following steps occurs:
- the fast-path checks the to-be-overwritten value of the stored-to field
- the color shows that this field has not been written to since the previous young generation marking phase
- the slow-path is taken
- the address of the stored-to field is added to the remembered set
- the new pointer value is colored and stored in the field
The pointer is then colored in a way such that subsequent fast-path checks will see that this object field has already passed a slow-path.SATB marking barriers
When marking moved over from the load barrier to the store barrier, the marking algorithm was changed to snap-shot-at-the-beginning (SATB). In SATB, the GC takes a snapshot at the beginning of the marking phase, and all objects reachable when marking started are guaranteed to be found and marked live by the GC. To accomplish this the GC needs to be told when references between objects in the object graph are broken. The store barrier therefore reports the to-be-overwritten field value to the GC. The GC will in turn then mark that object, and visit and mark its reachable objects. Interestingly, it is only the first time an object field is overwritten, within a marking cycle, that the Java threads need to report the to-be-overwritten field value to the GC. Subsequent stores to the same object field will only replace values that the GC is guaranteed to find anyway (because of the SATB property).
This enables the use of the act-once property for the GC marking part of the store barrier, such that, between the start of two consecutive marking phases, the store barrier slow-path will only be triggered once per stored-to object field.Fused store barrier checks
There are many similarities between the remembered set and GC marking parts of the store barriers. Both use colored pointer fast-path checks and their respective act-once property. Instead of having separate fast-path checks for both conditions, they have been fused into one combined fast-patch check. If either of the two properties fail, the slow-path will be taken and the required GC work performed.Store barrier buffers
Splitting up the GC barriers into a fast-path, slow-path, and the pointer coloring, reduces the number of calls to the C++ slow-path functions. Generational ZGC lowers the overhead further by placing a JIT-compiled medium-path between the fast-path and the slow-path. In the medium-path, the to-be-overwritten value and the address of the object field are stored in a store barrier buffer and returned back to the Java application code without taking the expensive slow-path. The slow-path is only taken once the store barrier buffer has been filled up. This amortizes away some overhead of transitioning from Java code to the C++ slow-path code.Barrier patching
Both the load barriers and store barriers perform their checks against values in global, or thread-local, variables that the GC changes when it transitions to a new phase. There are different ways to read these global variables in the GC barriers and the overhead of this is different on different CPU architectures.
In Generational ZGC this is improved by using code patching when possible. The global values are encoded inside the CPU instructions as immediate values. This removes the need to make an indirection through a global variable, or thread-local variable, to fetch the current value. The immediate values are patched when methods are called the first time after the GC has changed phase (E.g. starts young generation marking). This further reduces the overhead of ZGC's barriers.
Double-buffered remembered sets
Many GCs use a remembered set technique called card table marking to keep track of inter-generational pointers. When the Java threads write to an object field, they also write (dirty) a byte into a large byte array (the card table). Typically, one byte in the table corresponds to an address range spanning 512 bytes in the Java heap. To find all the old-to-young generation object pointers, the GCs have to locate and visit all object fields within the address ranges that correspond to the dirty bytes in the card table.
In contrast, Generational ZGC records the object field locations precisely, by using bitmaps, where each bit precisely represent a potential object field address. Each old generation region has a pair of remembered set bitmaps. One of the bitmaps is active and populated by Java threads and their store barriers, while the other bitmap is used by the GC as a read-only copy of all recorded old generation object fields that potentially point to objects in the young generation. These two bitmaps are atomically swapped at every young generation collection start. One benefit of this is that the Java threads don’t have to wait for the bitmaps to be cleared. The GC processes and then clears one of the bitmaps, while the other bitmap is concurrently being populated by the Java threads. Another benefit is that, since this allows the Java threads and GC threads to work on distinct bitmaps, it removes the need for extra memory barriers between the two types of threads. Other generational collectors that use card table marking, like G1, require a memory fence when cards are marked, resulting in potentially worse store barrier performance.
Collections without pre-reserved heap memory
Young generation collections in other HotSpot GCs use a scavenging model where live objects are found and relocated in a single pass. All objects in the young generation need to be relocated before the GC has complete knowledge about what objects were alive. The GCs using this model can only reclaim and hand back the young generation memory to Java application threads after all objects have been relocated. These GCs need to guess the amount of memory needed for all the surviving objects, and ensure that said amount of memory is available when the GC is started. If the guess is wrong, a more expensive cleanup operation is needed. E.g. in-place pinning of non-relocated objects (leads to fragmentation), or a Full GC with all Java threads stopped.
Generational ZGC uses two passes: the first to visit and mark all reachable objects, and the second to relocate marked objects. Because the GC has complete liveness information before the relocation phase starts, all objects can be relocated linearly within the heap regions. As soon as all live objects have been relocated out of a region, it can be reused as a new target region for relocations (or allocations by Java threads). Even when there are no more free regions to relocate objects into, ZGC can still proceed by switching over to compacting objects into the currently relocated regions. This is possible due to colored pointers.
This guarantees that Generational ZGC will make progress to relocate and compact the young generation without having to pre-reserve heap memory. This enables ZGC to run Java applications using lower heap sizes than if scavenging were used. Though, to ensure that Java applications don’t stall waiting for heap memory, the young generation collection marking phase still needs to start and finish before the Java application runs out of heap memory.
Dense heap regions
When relocating objects out of the young generation, the amount of live objects and the memory they occupy will be different between the regions. For example, more recently allocated regions will likely contain more live objects.
ZGC looks at the density of the young generation regions in order to determine which regions are worth relocating to get back memory and which regions are either too full or too expensive to relocate. The regions that are not selected for relocation are aged in place. The objects stays at their locations and the regions are converted to either survivor regions (kept in the young generation) or promoted into regions belonging to the old generation. The objects in the surviving regions get a second chance to die, with the hope that by the time the next young generation collection starts, enough objects have died to make them eligible for relocation.
This method of in-place aging dense regions lowers the resources used to collect the young generation.
ZGC already handles large objects well. By decoupling virtual memory from physical memory and over-reserving virtual memory, ZGC can usually dodge the fragmentation problems that sometimes make it hard to allocate large objects with G1.
With Generational ZGC, this is taken a step further by allowing large objects to be allocated in the young generation. Given that regions can be aged without relocating them, there’s no need to allocate large objects in the old generation just to prevent expensive relocations. Instead, they can get collected in the young generation if they are short-lived, or be cheaply promoted to the old generation if they are long-lived.
Full garbage collections
When the old generation is collected, there will be pointers from objects in the young generation pointing to objects in the old generation. These pointers are considered roots to the object graph of the old generation. Objects in the young generation mutate often, so young-to-old generation pointers are not tracked. Instead these pointers are found by running a young generation collection together with the old generation marking phase. When this young generation collection finds pointers into the old generation, they are handed over to the old generation marking.
This extra young generation collection will still execute as a normal young generation collection and leave live objects in the surviving regions. One effect of this is that surviving objects in the young generation will not be part of the weak reference processing, or class unloading, done by the old generation collection. This could be observed by the Java application by, for example, releasing the last reference to an object graph, call System.gc(), and then expect some weak references to be cleared/enqueued and/or classes to be unloaded. To mitigate this, an extra young generation collection is started when the GC is explicitly requested by the user. This young generation collection is done before the old generation collection starts and it promotes all surviving objects into the old generation.
Simpler barrier and pointer coloring schemes
The currently chosen load and store barrier implementation is non-trivial to understand. A simpler version could be easier to maintain at the cost of more expensive load and store barriers. Around ten different barrier implementations have been tried and all have been shown to be less performant than the chosen shift-based load barrier. Continuous investigation and analysis of the performance vs complexity trade-off might still be worth considering.
Keep using multi-mapped memory
The proposed colorless roots scheme could be skipped by using a simpler solution that utilizes multi-mapped memory. If more metadata bits in the pointers are needed than in non-generational ZGC, then the maximum heap size will be restricted. Another approach could be to use a hybrid solution, with some bits using multi-mapped memory and other bits being removed and added by the load and store barriers.
To keep track of colorless and colored pointers, the type system of ZGC is changed so that no implicit conversion can be made between the two types. The colored pointers are kept restricted to the GC code and barriers. As long as the runtime uses HotSpot's access API and barriers to access object pointers, it will only ever see the dereferenceable colorless pointers. The runtime-visible object pointer type is expected to always contain colorless pointers. Extensive verification code is injected into the different object pointer types to quickly find when pointers are broken or barriers are missing.
The standard set of tests for garbage collection algorithms will be used to ensure correctness.
Risks and Assumptions
Higher barrier overheads. The extra barriers could introduce a throughput degradation. The hope is that most of this will be offset by the gains of not having to frequently collect objects in the old generation.
Complexity of the new barriers.
Complexity of having two different collectors running concurrently.
Reference processing might be needed for the young generation.