Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8326035

CDS Object Streaming

    XMLWordPrintable

Details

    • JEP
    • Resolution: Unresolved
    • P3
    • None
    • hotspot
    • Erik Österlund
    • Feature
    • Open
    • gc
    • Implementation
    • hotspot dash dev at openjdk dot org
    • M
    • M

    Description

      Summary

      An object archiving mechanism for Class-Data Sharing (CDS), invariant of which Garbage Collector (GC) is selected at deployment time.

      Goals

      Currently, the Z Garbage Collector (ZGC), does not support CDS object archiving. This JEP aims at addressing that. The primary goals of this JEP are:

      • Support CDS object archiving for the ZGC (and indeed any other GC)
      • A unified CDS object archiving format and loader

      Secondary goals:

      • Keep GC implementation details and policies separate from the CDS archived object streaming mechanism

      Non-Goals

      It is not a goal at this time, to:

      • Remove the existing GC-dependent CDS object archiving mechanism
      • Unify CDS artifacts produced for -XX:+UseCompressedOops with -XX:-UseCompressedOops

      While removing the existing GC-dependent object archiving mechanism would allow detangling implementation details of other GCs from CDS object archiving, we will not consider that at this time, as the Leyden project is in early stages and it isn't clear what the effect of that would be yet.

      Success Metrics

      It should not take significantly longer for the JVM to boot with the new GC-agnostic object loader, compared to existing GC-specific object loaders for Serial GC, Parallel GC and G1 GC. As for ZGC which did not already have an archived object loader, it should at least not start slower or perform worse, when using archived object streaming.

      Motivation

      Users of ZGC enjoy having low GC latencies. But GC isn't the only source of latency jitter. In fact, when using ZGC, the biggest source of latency jitter is during the early phases of the application. Dealing with warmup issues, and to a lesser extent startup issues, is therefore important for ZGC users. Therefore, it is important that ZGC has full support for CDS, including object archiving, going forward.

      The existing object archiving system used in CDS directly maps memory from an archive file straight into the Java heap. In order for this approach to work well, the layout in the file has to exactly match, bit by bit, what the GC (and the rest of the JVM) at runtime expects to see. There are three different granularities of layout policies that might cause bits to not match, causing challenges for the current approach for object archiving:

      1. Heap layout. The heap layout is a high level strategy for where in the heap a GC chooses to place objects of a particular size and class.
      2. Field layout. The field layout is concerned with where to store contents of fields within an object. Typically, an offset relative to an object start address.
      3. Object reference layout. This is the bit encoding strategy, for reference fields.

      These three levels of object layout policies can vary significantly between GC implementations and heap sizes. That makes it challenging to share the same archived object format. Having different archived object formats for different GCs might be okay when creating a CDS archive for a particular deployment. However, the arguably most adopted way of using CDS, is through the default CDS archive shipped with the JDK, which starts the JVM faster by default. In this scenario, it is challenging to predict when building the JDK, which GC a user is going to select. We could resort to having duplicate object archives for different combinations of GCs and whether compressed pointers are used or not. Or we could make an object archive format that is completely invariant of GC implementation details. The proposal with this JEP, is to have a GC invariant object archiving mechanism.

      Heap Layout Impact

      Today CDS object archiving supports for -XX:+UseSerialGC, -XX:+UseParallelGC, -XX:+UseG1GC and -XX:+UseEpsilonGC, except notably when using windows, which does not support mapping a file into already mapped heap memory. All of these GCs have a contiguous memory layout, meaning that the heap is committed at a particular start address until a particular end address. The GC with the most complex memory layout out of these, is G1. With G1, the heap is split into multiple "regions". Objects, when laid out in memory, may not cross from one region to another; objects must be fully contained inside of a region. Objects that are larger than half the region size, get a special type of region called "humongous", containing the entire object, occupying multiple contiguous region heap memory granules.

      Due to G1 currently being the most constraining GC in terms of heap layout, the archived object format has been built around G1. Objects that are large enough that they might become "humongous" G1 objects, can not currently be dumped at all. The JVM has workarounds due to this object size restriction. Such workarounds are okay when users don't have to deal with it, but if object archiving is going to become observable by users, having object size restrictions based on implementation details of particular GCs, seems undesirable.

      Padding objects are inserted at what could be G1 region boundaries, to make sure that G1 region boundaries are not crossed by objects in the CDS archive. This formatting works for G1, and also for Serial GC and Parallel GC.

      As for ZGC, we do not have a contiguous heap layout, like the other GCs. The heap layout, conversely, is discontiguous and region based. What this means is that ZGC regions can occupy a vast amount of virtual address space. A discontiguous memory layout has the advantage of not having to pay for external fragmentation due to large objects with physical memory; we can pay the fragmentation tax with excess virtual address space instead. Another peculiarity is that unlike other GCs, ZGC distinguishes between three size categories of objects: small, medium, and large. Each region contains objects of only a single size category. Each size category has completely different object alignment. The difference in object alignment allows ZGC to compress certain GC internal data structures. This does however become a challenge in terms of fitting into the existing CDS archived object format. These reasons are indeed why ZGC does not yet have CDS object archiving support.

      Field Layout Impact

      For the most part, field layout is computed invariantly of GC selection. However, if compressed pointers (-XX:+UseCompressedOops) are used, either explicitly or implicitly based on the maximum heap size during dumping, the size of object reference fields will change from 64 bits to 32 bits. This, in turn, may cause the field layout to be entirely different. This is why there are two CDS archives shipped with the JDK: one for deployments using compressed pointers, and one for deployments not using compressed pointers. The proposed solution does not aim at changing this. Trying to solve that problem would be more involved for several reasons. For example, field offsets might have become exposed and relied upon by Unsafe, reflection and method/var handles. The offsets are also embedded in metadata of the archive. Similarly, -XX:+UseCompressedClassPointers also affects the field layout. There is currently no obvious reason for not using compressed class pointers, so we assume it is on when archiving objects.

      Object Reference Layout

      As for the currently supported GCs for object archiving, there are three different encoding variations for pointer compression (-XX:+UseCompressedOops), that depend on the object alignment (-XX:ObjectAlignmentInBytes) and heap size (-Xmx). When the heap size is small enough that the heap fits in the low order 4G of the virtual address space, a raw pointer encoding is used. When it does not fit, we can remove the redundant low order bits given the selected object alignment of the pointers. Yet a third variation makes the pointers relative to the heap start, rather than the virtual address space start.

      The current solution speculates on the particular encoding scheme used at dump time being used at deployment time as well, and patches the pointers if the speculation fails. Naturally a fourth variation is raw pointers that are not compressed (-XX:-UseCompressedOops)

      When using ZGC, pointers are annotated with "color bits". With -XX:-ZGenerational these bits are stored in high order bits (a fifth variation), while with -XX:+ZGenerational they are stored in low order bits (a sixth variation).

      Summary

      As described above, there are various factors that can affect the bit pattern of how objects are represented in memory.

      • There are currently six different pointer formats in HotSpot
      • There are various different heap layouts - contiguous, region based, discontiguous
      • Object alignment differs depending on object size for different GCs
      • Object location differs depending on object size for different GCs

      It is inherent that different GC implementation strategies yields rather different layout policies. Therefore, it makes sense to move the currently off-line layout decisions for the CDS archived objects to on-line decisions made by the deployment time selected GC.

      Description

      This JEP proposes a new object archiving format and loader, which does not depend on which GC is being used at all. It can be explicitly selected at dump time with -XX:+DumpStreamableObjects. When loading objects dumped with this mechanism, the GC owns object placement in the Java heap, and archived objects will be allocated, initialized and linked, one by one, based on a GC-agnostic object stream from the CDS archive. Loading objects in this way, is referred to as "object streaming".

      The structure of the archived objects in CDS, is that there is a set of roots, under which there is a graph of objects. The JVM requests root objects, but expects at the point of requesting the root object, that all transitively reachable objects have been materialized.

      In a way, the problem of traversing such object graphs efficiently and hiding the effect from the application, is in spirit rather similar to the problem of performing tracing GC. Something that ZGC has employed with great success, is performing the GC operation concurrently to the application. In fact, the proposal of this JEP, is to do precisely that: process the transitive closure of each root, concurrent to the application. Loading of roots can therefore be done lazily, and the bulk of the work can be done in an extra CDS thread. Lazy loading is triggered when a JVM subsystem asks for a particular root object from the archived heap. This could be, for example, when a class is loaded, and we request the initial Class object. Then the transitive closure of said root object, will be materialized.

      Traversal Optimizations

      The roots are traversed with depth-first-search (DFS) traversal. Since the concurrent CDS thread is going to perform DFS traversal through all roots in sequence, this traversal order is worth optimizing for. By laying out the object stream of the archive in DFS order from the beginning, the CDS thread can linearly walk through the objects, yielding the same traversal order as-if a stack was used to push and pop references in a proper DFS traversal. What this pre-ordering of the objects achieves, other than the obvious locality improvement, and avoiding using a stack, is the ability to split the archived objects into three distinct sections:

      1. Objects already transitively materialized by the CDS thread
      2. Objects currently being materialized by the CDS thread
      3. Objects not yet processed nor concurrently accessed by the CDS thread

      This well defined split allows the CDS thread to perform the bulk of its work, without interfering with the bootstrapping thread. When the bootstrapping thread performs lazy loading of a root that falls in the region not yet materialized, an explicit DFS traversal will be started for that particular root. During this traversal, most of the work can be done independently of the concurrent materialization from the CDS thread. Only when encountering objects in range number 2, is there any need for synchronization. This happens quite rarely in practice. When encountering concurrently materializing objects, we wait for the CDS thread to finish materializing them. The CDS thread uses an optimized traversal and will be able to finish faster, than the lazy materialization can. These region intervals are shifted like a wavefront atomically, under a lock, but the bulk of the work is done outside of the lock.

      What we get from this pre-ordering, is fast iterative traversal from the CDS thread while allowing laziness and concurrency with minimal amount of coordination. This way the CDS thread can remove the bulk of the work of materializing the Java objects from the critical bootstrapping thread.

      Object Reference Format

      The object format of the archived heap is similar in its payload, to a normal object. The only GC-specific part of the object layout, is the object reference layout. Therefore, object references are encoded as DFS indices, which in the end map to what index the object is in the buffer, as the objects are laid out in DFS order. This number is referred to as the "object index" in the archive. The object indices start at 1 for the first object, and the number 0 conveniently represents the null value. The object index of an object is a core identifier of objects in this approach. These indices lend themselves perfectly for optimized table lookups, as the table can be implemented as a simple array. There is one such table mapping object indices to materialized Java heap objects, and another table mapping object indices to buffer offsets to the corresponding archived object.

      Early vs Late Object Materialization

      When loaders start materializing objects, we have not yet come to the point in the bootstrapping sequence of the JVM, where GC is allowed. During this early phase, we apply more aggressive optimizations exploiting that there is no GC to coordinate with yet.

      The table mapping object indices to Java objects contains raw object addresses before GC is allowed, and as we enable GC in the bootstrapping, all raw addresses are replaced by handles stored as standard global roots of the JVM. All handles are handed back from the CDS thread when materialization has finished. The switch from raw addresses to using handles, happens under a lock while no iteration nor tracing is allowed. This allows early materialization to execute faster. A dual purpose of this table, is to also keep track of all visited objects across all DFS traversals.

      The initialization code is also performed in a faster way during early object materialization. In particular, before GC is allowed, we perform raw memory copying of the archived object into the Java heap, followed by linking object references. The assumption made here is that before any GC activity is allowed, we won't need to worry about concurrent GC threads scanning the memory, and getting surprised to find objects that for an instant have references that are not valid. Once GC is enabled, we revert to a bit more careful approach that uses a pre-computed bitmap to find where object references go, and carefully copy only the non-reference information with raw memory copying, while the references are set separately with appropriate GC barriers coping with GC activity.

      As for the background loading from the CDS thread, if we materialize too much early on and run out of memory, this will result in a JVM error, and the process will shut down. To deal with this, the CDS thread asks the GC for a budget of bytes it is allowed to allocate before GC is allowed. If this threshold is reached, the CDS thread has to wait until bootstrapping has come to a point where GC is allowed, before it continues with further object materialization.

      When we get to the point in the bootstrapping where GC is allowed, we resume materializing objects that didn't fit in the budget. Before we let the application run, we force materialization of any remaining objects that have not been materialized by the CDS thread yet, so that we don't get surprising OOMs due to object materialization while the program is running.

      Object Linking

      The table mapping object indices to Java heap objects is filled in when an object is allocated. Materializing objects involves allocating the object, initializing it, and linking it with other objects. Since linking an object requires objects it can reach through its reference fields to be at least allocated, the iterative traversal of the CDS thread will first allocate all of the objects in its zone being worked on, representing all objects not yet materialized, that are transitively reachable from the currently processed root. When all objects of the current batch have been allocated, we can perform initialization and linking in a second pass. The lazy tracing materialization links objects when popping an entry from the DFS stack. In this context, we can compute the address of the field referencing the object about to materialized. Therefore, we can update those fields to point at the materialized object.

      One interesting benefit of object level linking, is that we can better deal with the off-line snapshot of objects being mixed with on-line objects from the deployment run. As an example, the current direct mapping based object loader, dumps the entire String Table. This dumping requires an array keeping track of all string objects from the string tables. It is sometimes encoded as an array of arrays, because the one dimensional array might become large enough to be a humongous object for G1, which isn't supported. What dumping the string table buys is, is keeping track of an identity property of boolean nature, of certain string objects: whether they were interned or not. In the streaming approach, we don't need to dump the entire string table. Instead, strings in the archive that were interned, have a bit set in a bit map, representing this identity property. When linking interned strings, we dynamically intern the string, which may yield linking to an off-line archive object, or an on-line interned string from the deploy time JVM.

      Scalability

      The streaming approach processes objects one by ones, rather than mapping memory from a file straight into the Java heap. It is worth discussing the scalability implications of that.

      In a cold start, there is a cost per byte anyway. Hence, the implications are similar for both solutions, during cold starts. If we want fast cold starts, we should probably strive to keep the archive size small. Since the streaming approach embraces that there will indeed be work per object and instead aims at offloading that cost from critical bootstrapping, it better hides the per byte cost of a cold start. It also lends itself more naturally to compression as similarly the decompression can be offloaded. Cold starts should ultimately benefit from a smaller artefact size.

      As for warm starts, it is worth noting that the size of heap archives does not grow linearly over the number of classes, as many resolved references contain already captured strings referring to various common types, etc. This currently makes it more difficult to get large heap archives in the first place. Because of this, there are in general much larger overheads than object streaming, when application size grows larger.

      Should this eventually change and processing large archives becomes important, the approach has also been designed to allow parallelization in the future. That allows at least deployments with available CPU resources, to process the objects faster, if concurrency alone is insufficient. As for CPU constrained environments running large applications, the default would currently pick the mapping solution. Determining whether the GC-agnostic solution works well enough in such situations, is outside the scope of this JEP.

      Alternatives

      When implementing support for ZGC, it isn't strictly necessary to build a GC-agnostic solution. One possible solution would be to double down on more GC-specific logic, and have a specific ZGC object loader that lays out objects with a heap layout and pointer layout expected by ZGC. This has some notable disadvantages:

      • The default CDS archives shipped with the JDK would have to have duplicate information for the extra object archive required by ZGC, inflating the size of the JDK unnecessarily, compared to a GC-agnostic solution.
      • Development on ZGC will be slowed down and complicated, by entangling GC implementation details with how we dump objects to the CDS archive.

      As for advantages of doubling down on ZGC-specific object dumping logic, it's a bit unclear. Supposedly, the main advantage would be starting the JVM faster. However, from current experiments, it appears that the streaming loader is very efficient without needing to introduce ZGC specific knowledge.

      As for GC-agnostic object archiving, different approaches have been considered. Most of them involved materializing all objects at once very early, with a lack of laziness. This led to trouble when running on very small heap sizes, as GCs would be tempted to perform a GC after a significant part of the heap is allocated. However, the JVM is not yet in a state where it can perform GCs that early. Therefore, allowing laziness allows the mechanism to be more GC-agnostic.

      Testing

      A large amount of CDS tests, including for object archiving have already been written. They will be adapted to regularly test with ZGC and the new object streaming approach.

      Risks and Assumptions

      Since the bulk of the work due to linking on object granularity running in an extra CDS thread, there is an assumption that it is okay to have both the bootstrapping thread and the CDS thread run at the same time. Some constrained cloud environments might not be willing to give you that extra core, even for a short period of time, which would result in slightly delayed startup. Having said that, using a concurrent GC such as ZGC in such a constrained environment, is not going to work very well either in general. It is also expected that cold startup time in such environments is more interesting than hot startup time. The streaming approach yields itself naturally to applying compression on the object payload, as the off-line and on-line object formats are designed to be different, which is likely beneficial for such environments.

      There is another risk: memory footprint. The existing heap archiving solution maps the archived objects straight into the Java heap. However, the streaming approach loads the heap archive to a temporary location in memory, while it materializes objects into the Java heap. Therefore, during bootstrapping, the archived heap footprint is much higher due to its duplicate nature. However, plotting typical memory usage over time, the time of bootstrapping is typically far below the eventual memory footprint of the application when it starts running. Hence, there will only be a footprint regression, if the application never needs more memory (Java heap, native memory, code cache, etc), than the size of the archived objects. This does seem rather unlikely.

      Attachments

        Issue Links

          Activity

            People

              eosterlund Erik Österlund
              eosterlund Erik Österlund
              Erik Österlund Erik Österlund
              Ioi Lam, Stefan Karlsson, Vladimir Kozlov
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: