Loading...

Type: Enhancement
Resolution: Duplicate
Priority: P3
Fix Version/s: None
Affects Version/s: None
Component/s: hotspot
Labels:
None

Subcomponent:
gc
CPU:

generic
OS:

generic

# Summary

This is a proposal to introduce Automatic Heap Sizing (AHS) to the Java Serial Garbage Collector. Adding AHS will enable the Java heap to adapt based on changes in workload and memory pressure. The proposal draws heavily from OpenJDK bug report [https://bugs.openjdk.org/browse/JDK-8329758](https://bugs.openjdk.org/browse/JDK-8329758) that proposes adding AHS to ZGC. This bug report should likely be the precursor to a JEP as this body of work is large enough to require one.

## Goals

When using Serial GC:
- By default, allow the heap to safely utilize as much memory as is needed.
- Ensure that the JVM is a good citizen by returning memory if:
- Memory isn’t needed
- There is low availability of system memory
- Be able to dynamically adapt the heap size due to unpredictable circumstances.
- Be able to dynamically rebalance generational spaces to adapt to the changes in the data life-cycle.

## Non-Goals

It is not a goal of this bug report to:
- Maintain an optimal heap size.
- Override existing configurability of static heap bounds set when using existing heap sizing JVM options.

## Motivation

The conditions for the JVM to ergonomically select the Serial collector at startup are constrained to the JVM having access to less than 2 CPU cores or 1792M of memory. It is common for JVMs deployed into these constrained environments to run without any user-supplied configurations. This includes not specifying the heap size, the most notable configuration requirement expected from users. This leaves the Java heap to be configured using defaults which, historically, are very often suboptimal. Suboptimal configurations inflate GC overhead which negatively impacts tail latencies and deployment costs. The reason for this lack of user-supplied configuration is that selecting a good heap size is notoriously difficult. The range of performant heap sizes for an application depends on many technical details across the entire technology stack, such as:
- **Actors**: The aggregate sum of all work that end users and other systems place on the application.
- **The application**: The amount of memory needed to persist data as well as the amount of working memory needed for each incoming request.
- **The libraries**: When libraries use direct mapped byte buffers or spin up threads, there is less memory available for the heap. Calculating this upfront is notoriously difficult to impossible. This is especially true when running in containers.
- **The JVM**: The JVM needs to use memory resources for other things than the heap, such as GC metadata, metaspace, code cache, JIT threads, ZIP buffers, etc.
- **The other processes**: The application’s memory needs need to be balanced by the needs of other processes.
- **The OS**: Policies around when to start memory compression or swapping.
- **The hardware**: Memory availability has a big impact. The time to perform GC depends on CPU availability, memory bandwidth, caching policies, atomic instruction implementation, etc. For example, a common Kubernetes deployment configuration is 1000 millicores. Even for the Serial collector, 1000 millicores will leave GC starved for CPU.

Why JVMs are not being configured given the impact that heap configuration has on latency, isn’t well known. Maybe deployers deploy without any configurations with the (reasonable?) expectation that the JVM will “just work”. That said, even if deployers engage in some experiential process design to discover an optimal configuration, this configuration will be static. Consequently, it’s unlikely that the JVM will be able to adapt to all conditions that are typically found in large homogenous deployments. For example, there are always harmonics at work that often result in individual JVMs requiring different configurations. It is impossible to know, in advance, which JVMs will need which configuration even when they are running the same code on the same hardware. This gets more complicated when a heterogeneous set of applications is deployed on the same server.

Some of the conditions that upset GC performance include:
- Unpredictable or expected bursts in the workload.
- Unpredictable or expected changes in allocation patterns due to modal shifts or one-off events.
- Unpredictable or expected change in data life-cycles due to modal shifts or one-off events.
- Unpredictable application profiles after upgrading software (application, libraries, JVM, OS, etc.).
- Unpredictable memory usage due to direct mapped byte buffers.
- Unpredictable memory usage due to JVM implementation details (GC metadata, metaspace, code cache, thread stacks).
- Unpredictable memory usage due to other processes.
- Unpredictable proactive OS memory compression policies compressing garbage instead of collecting it.
- Unpredictable memory usage due to fragmentation of JVM internal memory slowly increasing over a long period of time.
- Unpredictable memory layouts at startup.

Finding a static heap size for all possible dynamic circumstances is not possible. To manage these situations requires that one takes the costly step of having extra memory resources in reserve. Even then, any statically defined configuration will be sub-optimal for those cases other than the primary modality of the application. As the process to tune the Serial collector is well known and the runtime has the metrics needed to drive the process, the best solution is to enhance the JVMs ability to automatically adjust Java heap to minimize the effects of GC interference on application performance.

## Description

Following in the footsteps of AHS for ZGC, this document proposes an Automatic Heap Sizing policy for the Serial collector. This policy will automatically find a heap size that minimizes GC interference by dynamically adapting to changing circumstances on the server. It selects heap sizes within the minimum and maximum size boundaries, which users can still set as usual with the `-Xms` and `-Xmx` command-line options. However, the default maximum and minimum heap sizes will be changed when using Serial GC to give the automatic heap sizing as much flexibility as possible by default. The changes will be as follows:
- Default static minimum and initial heap sizes (`-Xms`) are changed to 16MB.
- Default static maximum heap size (`-Xmx`) is changed to 100% of the available RAM of the computer or CGroup limit.
- A new dynamic maximum heap size dynamically adapts to changes in memory availability of the computer.
- The default aggressiveness of the GC, which will affect the heap size, can be controlled by the JVM flag, `-XX: GCTImeRatio`. It is expected that lower values will require a larger heap whereas higher values will allow for a smaller heap. The default value tries to strike a reasonable balance between MMU and memory footprint. The flag is manageable, meaning that it may be updated at runtime, if desired.

### Automated Sizing

Java heap for the serial collector is generational in that it is subdivided into a Young and an Old generational space. Young is further subdivided into Eden and two Survivor spaces. Each of these spaces serves a different role. These roles are:
- **Eden**: Used for the initial allocation of data into Java heap.
- **Survivor**: The active survivor space retains data that isn’t old enough to promote to tenured.
- **Tenured**: Retains long-lived objects (data).

As each space has its own distinct role in Java heap, it follows that each space should be tuned so that it best supports that role.

Automated Tuning needs to consider overall global memory pressure. The table below outlines the conditions and expected reactions.

| Memory Pressure | GC Overhead | Action on heap size |
|-----------------|----------------------|---------------------|
| High | Greater than threshold | Contract |
| High | At threshold | Contract |
| High | Less than threshold | Contract |
| Moderate | Greater than threshold | Expand |
| Moderate | At threshold | Maintain |
| Moderate | Less than threshold | Contract |
| Low | Greater than threshold | Expand |
| Low | At threshold | Maintain |
| Low | Less than threshold | Contract |

As can be seen in the table, if global memory pressure is high then the action should be to return memory to the OS. Otherwise, GC Overhead drives the decision as to expand, contract, or maintain the size. At issue is that simply expanding or reducing Java heap may leave the application starved for more memory in one of the heap spaces. For example, it is common that shrinking heap leaves Survivor spaces undersized. The consequence of this is premature promotion and with that, an increase in Full GC cycles. More Full GC cycles will inflate GC overhead. It is often the case that right-sizing Eden, Survivor, and Tenured according to their roles will not only lower GC overhead, but it will also result in a smaller heap. Thus, rather than focusing on total heap size, the more effective tuning strategy is to focus on the needs of Eden, Survivor, and Tenured. Using this strategy, total heap size is a sum of the individual spaces instead of the spaces being a fraction of the total. The following sections describe metrics and tuning strategies needed for each space.

### Sizing Eden

Under normal conditions, mutator threads will allocate objects into Eden. A young GC cycle occurs when a mutator experiences an allocation failure in Eden and the conditions are not set for a full collection. The time spent running the GC cycle is known as GC overhead. Total GC overhead is calculated by summing the total time spent garbage collecting over the total application runtime. The two factors that affect GC overhead are GC frequency and GC duration. From the statements above we can deduce that GC frequency is a function of the size of Eden and the allocation rate. Consequently, these are two of the levers that we can pull that will impact GC overhead. Of these two levers, it is the size of Eden that we have direct control over. Thus, the proposal here is to moderate GC overhead by manipulating the size of Eden. The following is an example of the calculations involved. Do note that the assumption made is that altering the size of Eden will have no impact on GC duration. The reality is there is a complex relationship between GC frequency and pause duration. For the purposes of this exercise, the complexity can be ignored as we know that for stable heaps, pause times for fully copying collectors is constant. While the Serial heap is a combination of a copying and in-place collector, the banding of young gen collection pause times around a median pause time is commonly observed.

Let Eden = 500MB
Let Allocation Rate = 50MB/Second
GC Interval = Eden / Allocation Rate
= 500MB / 50MB/second
= 10 seconds
GC Frequency = 6 cycles / minute

Let window = time for 60 GC cycles to run.
= 10 minutes (for this example)
Let GC duration = average GC duration over the window
= 200 ms (for this example)

Total GC Time = window * GC Frequency * GC duration
= 10 minutes * 6 cycles / minute * 200 ms
= 12000 ms (12 seconds of GC pause time)

GC Overhead = Total GC Time / window
= 12 secs / 600 secs
= 2% GC overhead

GC Overhead is 2% or application throughput is 98%.

If the target GC overhead is 1% then we can calculate that, all other things being equal, to reach this target, Eden should be expanded to 1000MB. The reality is, reducing GC frequency tends to lead to higher allocation rates. Unfortunately, the overall impact of this effect isn’t predictable, however, pause times will tend to band around a medium time. The other notable point is that it generally takes several rounds of tuning to stabilize GC overhead to an acceptable level.

One of the issues with the calculations above is that it’s assumed that the allocation rate is constant. In reality, allocation rates are a function of the pressure actors are placing on the application. Given that this pressure will fluctuate, allocation rates will fluctuate. These fluctuations generally happen in such a small range that there isn’t any advantage to reacting to them. Some form of dampening will need to be employed to prevent the heap from constantly resizing. The recommendation here is to use `GCTimeRatio` to set the target GC overhead.

### Sizing Survivor

The role of Survivor is to retain objects that are longer-lived transients. It needs to be sized large enough to retain all the recently allocated objects that have survived a collection. The number of GC cycles that an object survives is known as the age of the object. When the age of an object meets the tenuring threshold, it is copied to Tenured. The max tenuring threshold in the JVM is 15.

The age table tracks the volume of data at each age. This data is used to help estimate how much data will remain in Survivor at the end of the next collection. If the amount of data that is expected to survive a collection exceeds the size of the Survivor space, then the tenuring threshold is rolled back to the point where the volume of data that will be retained will fit into the Survivor space. This condition is known as premature promotion.

There are several things that happen when data is prematurely promoted. The most damaging of these is that premature promotion increases the frequency of Full collections which will inflate GC overhead. This increase in premature promotion makes zombiism more likely. A zombie is a dead object in Tenured that is necessarily treated as a live object. Zombiism causes more data to be promoted which in turn further increases Full GC frequency which, of course, further inflates GC overhead.

There is yet another form of premature promotion known as hidden premature promotion. Hidden premature promotion occurs when the age of transient data reaches the tenuring threshold. The general cause of this is due to accelerated aging due to high Young GC frequency. While the optimal solution is to reduce the allocation rate, the solution that can be implemented here is to ensure that both the Survivor and Eden spaces are adequately sized to ensure that objects are not prematurely promoted. In addition, looking at how the Tenuring Threshold, the data in the Age table can be used to perform the necessary analysis.

The data in the age table represents the tailing edge of the curve described by the Weak Generational Hypothesis. It is this hypothesis that motivates the use of generational spaces. It states that most objects die young. The age table proves this as it almost always shows that there is a large recovery of memory from age 0, a slight smaller recovery of memory from age 1, a smaller recovery of memory from age 2, and so on. This recovery continues until at some age only long-lived data remains. It is at this age that data should be tenured as this reduces copy costs. The tenuring threshold should be set to this age as this will maximize heap recovery and minimize copy costs. It also minimizes the promotion of transients to Tenured space.

A more detailed explanation follows this brief description of the steps to tune survivor.
1. Calculate a desired survivor occupancy that is large enough to minimize premature promotion but no larger.
2. Calculate the tenuring threshold so that it balances the recovery of transients vs the copying costs for long-lived data.

The following is a sample analysis of the Age table taken from production data.

Max Tenuring Threshold = 15
Desired Survivor Occupancy = 67108864 Bytes
Number of Collection Cycles = 1384
Number of Collections that prematurely promoted = 856
Premature promotion rate = 62.8%

| Age | Average Volume (B) | Max Volume (B) |
|-----|--------------------|----------------|
| 1 | 48944511 | 189296194 |
| 2 | 21333235 | 64342000 |
| 3 | 6156835 | 53250208 |
| 4 | 1090482 | 45645728 |
| 5 | 269845 | 43009496 |
| 6 | 65743 | 43009496 |
| 7 | 11744 | 5973464 |
| 8 | 2791 | 3807256 |
| 9 | 0 | 0 |
| 10 | 0 | 0 |
| 11 | 0 | 0 |
| 12 | 0 | 0 |
| 13 | 0 | 0 |
| 14 | 0 | 0 |
| 15 | 0 | 0 |

The average total occupancy of survivor is 77,875,176B which is larger than 67,108,864B desired survivor occupancy. This is an indication that the premature promotion problem is severe, and consequently survivor space needs to be aggressively enlarged. In these cases, a 2x enlargement would be recommended. The goal is to keep increasing the size of survivor until there is little to no premature promotion. At this point, the desired survivor occupancy can be calculated by summing up the 90-percentile occupancies for each age. Note that the desired survivor occupancy is by default, 50% of the total survivor space size. This desired survivor occupancy is user configurable and is sometimes set as high as 90%. Not only does this configuration need to be respected, but it also needs to be known because it affects the sizing calculations.

The value for the tenuring threshold should maximize the recovery of transients while minimizing overall copy costs. To do this, the tenuring threshold should be set at the age of data in survivor where little to no more memory is being recovered. Algorithmically this happens when the slope of the data points representing the volume of data is either 0, or close to 0. In the above table, a tenuring threshold of 6 meets this definition. The issue in this case is that the data in the age table is distorted by the rate of premature promotion. This impedes the ability to perform this analysis. In this case the tenuring threshold should be set to `MaxTenuringThreshold`.

### Sizing Tenured

The role of Tenured is to provide a space to hold long-lived objects. Under normal circumstances, only objects in Young that need to be promoted are allocated Tenured. The task is performed by the GC thread. Object too large to be allocated in Eden (or Survivor) will be allocated directly into Tenured by a mutator thread. However, this should be a rare event. A full collection cycle will be triggered when it is determined that there is no longer enough memory in tenured to support the promotion of the volume of data that is expected to be promoted. From this it can be deduced that the Full GC frequency is a function of the size of Tenured and the promotion rate. Higher promotion rates equate to more frequent Full GC cycles and that equates to much higher GC overhead. This is one of the reasons why it is important to ensure that the promotion of transients needs to be minimized.

Tenured space should be large enough to hold the live set plus some extra to support future promotions. The live set size is defined as the volume of data that is consistently live after a Full GC.

Filling Tenured triggers a full collection. Similarly to what happens in Eden, the allocation rate in Tenured plus the size of Tenured sets Full GC frequency. While most allocations in Tenured are due to data being promoted, mutators will allocate directly into Tenured if the object being allocated is too big to fit into Eden (when Eden is empty). However, this is a rare case which can be ignored for the purposes of this exercise

Assuming that Young has been adequately tuned so that only a minimum of transients are being promoted, the only lever that is directly available to reduce GC Overhead due to full collections is the size of Tenured. Thus, Tenured should be sized to support the live set plus some working space. This working space should likely be between 50% to 100% of the live set size. Tenured should grow as needed to reduce Full GC frequency. This reduced Full GC frequency should help reduce GC overhead helping the JVM to meet the GCTImeRatio target. It should be sufficient to start with a recent medium value for the live set size after a Full GC.

## Sizing Java Heap

As can be seen from the above description, the overall size of Java heap is the sum of the individual parts. Since each of these parts plays a different role in how GC functions, each has its own memory needs and each of these needs will be described by a unique set of metrics. In the case where observed GC overhead is less than the GCTimeRatio target, heap should be returned to the system using the calculations described above to rebalance Eden, Survivor, and Tenured spaces.

The complicated case is when there isn’t enough system memory to support heap enlargement in cases where the observed GC overhead exceeds GCTimeRatio.

This leaves the case that hasn’t been discussed is how to manage the situation when there isn’t enough system memory to support heap enlargement when it is needed. Current thinking is that if the Tenured space is too small for the live set, then an OutOfMemoryError will be thrown and the JVM will terminate. Better to enlarge Tenured even at the risk of under sizing Young to avoid the OOME from being thrown.

## Rate of Expansion

The question here is, how fast should a memory space be expanded (or contracted). The answer that will be explored first is, the space should be immediately expanded to the size calculated by the resizing heuristic. The underlying principle is that a fire should be put out before it becomes too big to manage.

## Heap Committing

Similarly to the comments in the AHS ZGC JEP, committing and paging in memory can cause latency problems if it is performed by application threads. Currently, when the user sets `-Xms` and `-Xmx` to the same value, the Serial heap commits the memory upfront. Moreover, when a user specifies the `-XX:+AlwaysPreTouch` option, the heap memory is paged in before running main. There is a tradeoff between startup and warmup performance involved here. The AlwaysPreTouch is disabled by default, which favors startup but reduces warmup performance. With the proposed defaults, users won’t benefit from committing memory upfront or paging in heap memory. The only caveat is that by default, the Serial heap will be small meaning the startup costs may not be significant.

## Memory Pressure

With the tactics described thus far, a JVM process may automatically find an appropriate heap size, given a default GC pressure. However, it might be that multiple processes are running concurrently, and that if we let the JVM use as much memory as it desires, the computer will not have enough memory available to satisfy everyone.

This proposal advocates for a mechanism similar to that proposed by ZGC for how the resizing should act on a server running low on memory. A small portion of the computer's memory is to be treated as a reserve that we prefer not to use. The GC pressure of the automatic heap sizing heuristics is scaled by how much of said memory reserve is consumed. The memory usage of the computer is monitored continuously, and as the computer runs low on memory, GC heuristics will work harder to shrink the heap. Before the heap runs out of memory, the GC will work very hard. As the memory reserve gets consumed, the memory pressure increases first linearly, and then exponentially.

Importantly, this mechanism gives all JVMs a unified view of how critical memory pressure is, which allows processes under the control of these heuristics to reach an equilibrium of GC pressure, as opposed to randomly reacting to reactions of other JVMs without a shared goal and strategy.

When using macOS or Windows with memory compression enabled, the ratio of used memory being compressed vs not compressed is continuously monitored. The perceived size of the memory reserve gets scaled according to said compression ratio. Consequently, when the OS starts compressing more memory, the GC will work harder to reclaim garbage and give memory back to the OS, relieving its compression pressure.

The max heap size is dynamically adapted to be at most the memory available on the computer, plus a small critical reserve of memory. Exceeding said threshold will result in OutOfMemoryError if the situation cannot be resolved in a timely manner.

## Generation Sizing

When updating the heap size, the distribution of memory between the young and old generations needs to be reconsidered. With the Serial collector, there is a hard boundary between the two generations. One way of easing this restriction is to reorder the generational spaces. This should allow the boundary to be reset after a Full collection.

## Alternatives

The alternative solution is to reset the JVMs 25% max heap size to a larger value. However, the consensus is that this solution isn’t safe as it changes the memory behavior of a significant number of deployments that are not configured thus increasing the risks of more frequent OOM Killer events. Additionally, with the current implementation, it is very likely that deployments that don’t need more than 25% will now be using a larger volume of real RAM. Again, the risk of more frequent OOM Killer events.

## Testing

This enhancement primarily affects performance metrics. Therefore, it will be thoroughly tested with a wide variety of workloads. The defined success metrics will be tested on said workloads.

## Risks and Assumptions

By changing the default maximum heap size from 25% of the available memory to most of the available memory, there is a risk that the new heuristics use more memory than the current implementation would, and other processes run out of memory. However, with a 25% heap size policy and few heuristics to try to limit the heap size, there is already a risk of that happening when several JVMs with the default heap size, run on the same computer. Moreover, the dynamically updated max heap size is very likely to be able to throw OOM before exceeding the computer memory limits.

duplicates

JDK-8350152 Automatic Heap Sizing for the Serial Garbage Collector

Draft

Details

Description

Attachments

Issue Links

Activity

People

Dates