-
Bug
-
Resolution: Unresolved
-
P4
-
None
-
None
As currently implemented, handling a failed allocation request causes FullGC to trigger, and only after we fail to allocate following a FullGC do we throw OOM.
This requires that we Stop the world before throwing. Since a user who has chosen Shenandoah GC is presumably seeking to avoid long STW pauses, it is probably better to perform a concurrent GLOBAL GC and throw OOM if we can still not satisfy the allocation request following completion of GLOBAL GC.
In this case of Generational Shenandoah, a GLOBAL GC does marking of old and young, followed by multiple mixed evacuations to reclaim all of the old-gen memory that might not fit within the initial evacuation effort.
Throwing OOM after concurrent GLOBAL GC gives the application a better opportunity to recover and resume proper pause-free operation, possibly discarding certain caches and/or rebalancing loads. STW Full GC is more disruptive to ongoing latency-sensitive operation of the workload.
Note that there is no perfect solution. We may experience "spurious" OOM exceptions:
1. For example, even in the current more conservative implementation, there may be a race between multiple allocating threads, both attempting to allocate very large objects. The first thread to retry its allocation might succeed and the second thread to retry its allocation might fail (because the first thread consumed the newly available memory). So the second thread experiences OOMError even through another GC would have reclaimed the memory it wanted to allocate.
2. A GLOBAL GC won't necessarily reclaim all garbage. Following a concurrent Generational GLOBAL GC, we may need to perform multiple concurrent mixed evacuations in order to reclaim all of the dead memory identified by the GLOBAL GC mark. However, the first evacuation performed by the GLOBAL GC will normally reclaim a significant amount of garbage (as guided by garbage first heuristic). If this is not enough memory to satisfy the pending allocation request, we are in "dire straits", and a fail-fast OOMError is probably the better remediation than repeated attempts to allocate following repeated GC cycles.
3. Any concurrent remediation of failed allocation requests runs the risk that we will continue to fail to allocate because of the floating garbage that accumulates while we are performing concurrent GC. Though a stop-the-world full GC fully reclaims the floating garbage, it really only postpones the inevitable. While the world is stopped to perform full GC, requests for allocation continue to accumulate (in hardware request buffers, or in queued service requests that have not yet been satisfied). We will eventually experience a cascade of allocation failures until the service is able to do something to reduce its need for memory.
This requires that we Stop the world before throwing. Since a user who has chosen Shenandoah GC is presumably seeking to avoid long STW pauses, it is probably better to perform a concurrent GLOBAL GC and throw OOM if we can still not satisfy the allocation request following completion of GLOBAL GC.
In this case of Generational Shenandoah, a GLOBAL GC does marking of old and young, followed by multiple mixed evacuations to reclaim all of the old-gen memory that might not fit within the initial evacuation effort.
Throwing OOM after concurrent GLOBAL GC gives the application a better opportunity to recover and resume proper pause-free operation, possibly discarding certain caches and/or rebalancing loads. STW Full GC is more disruptive to ongoing latency-sensitive operation of the workload.
Note that there is no perfect solution. We may experience "spurious" OOM exceptions:
1. For example, even in the current more conservative implementation, there may be a race between multiple allocating threads, both attempting to allocate very large objects. The first thread to retry its allocation might succeed and the second thread to retry its allocation might fail (because the first thread consumed the newly available memory). So the second thread experiences OOMError even through another GC would have reclaimed the memory it wanted to allocate.
2. A GLOBAL GC won't necessarily reclaim all garbage. Following a concurrent Generational GLOBAL GC, we may need to perform multiple concurrent mixed evacuations in order to reclaim all of the dead memory identified by the GLOBAL GC mark. However, the first evacuation performed by the GLOBAL GC will normally reclaim a significant amount of garbage (as guided by garbage first heuristic). If this is not enough memory to satisfy the pending allocation request, we are in "dire straits", and a fail-fast OOMError is probably the better remediation than repeated attempts to allocate following repeated GC cycles.
3. Any concurrent remediation of failed allocation requests runs the risk that we will continue to fail to allocate because of the floating garbage that accumulates while we are performing concurrent GC. Though a stop-the-world full GC fully reclaims the floating garbage, it really only postpones the inevitable. While the world is stopped to perform full GC, requests for allocation continue to accumulate (in hardware request buffers, or in queued service requests that have not yet been satisfied). We will eventually experience a cascade of allocation failures until the service is able to do something to reduce its need for memory.