-
Bug
-
Resolution: Unresolved
-
P4
-
None
-
None
At the end of final-mark, we identify as immediate garbage regions that had usage but no live data. The entirety of each immediate garbage can be "immediately reclaimed" without waiting for evacuation or updating references.
In truth, this memory cannot be immediately reclaimed at the end of final mark. First, we do concurrent thread roots, then concurrent weak references, then concurrent weak roots, and finally concurrent cleanup. At the end of Concurrent cleanup, this immediate garbage is finally reclaimed. Then, we finish concurrent evacuation and proceed to update references.
In one AWS service load test, we have observed the following scenario:
1. There is 8608M of immediate garbage identified at end of final mark
2. Concurrent roots consumes 77.6 ms
3. Concurrent weak references consumes 0.703 ms
4. Concurrent weak roots consumes 4.254 ms
5. Concurrent clean consumes 0.108 ms
6. During these 82.7 ms of concurrent GC activities, the mutators consume all remaining available memory and a 2563K TLAB allocation failure causes us to cancel concurrent GC and degenerate.
This behavior is observed to repeat 2-3 times during a certain phase of the workload, with each degenerated GC cycle introducing approximately 70 ms of stop-the-world pause.
In this scenario, it would be much better for handle_allocation_failure() to behave in the following way:
1. If we have identified immediate garbage that is about to be recycled, stall the thread that requests to allocate, but do not cancel GC.
2. At the end of "concurrent cleanup", if there are threads waiting due to alloc failure, invoke ShenandoahControll:notify_alloc_failure_waiters().
In truth, this memory cannot be immediately reclaimed at the end of final mark. First, we do concurrent thread roots, then concurrent weak references, then concurrent weak roots, and finally concurrent cleanup. At the end of Concurrent cleanup, this immediate garbage is finally reclaimed. Then, we finish concurrent evacuation and proceed to update references.
In one AWS service load test, we have observed the following scenario:
1. There is 8608M of immediate garbage identified at end of final mark
2. Concurrent roots consumes 77.6 ms
3. Concurrent weak references consumes 0.703 ms
4. Concurrent weak roots consumes 4.254 ms
5. Concurrent clean consumes 0.108 ms
6. During these 82.7 ms of concurrent GC activities, the mutators consume all remaining available memory and a 2563K TLAB allocation failure causes us to cancel concurrent GC and degenerate.
This behavior is observed to repeat 2-3 times during a certain phase of the workload, with each degenerated GC cycle introducing approximately 70 ms of stop-the-world pause.
In this scenario, it would be much better for handle_allocation_failure() to behave in the following way:
1. If we have identified immediate garbage that is about to be recycled, stall the thread that requests to allocate, but do not cancel GC.
2. At the end of "concurrent cleanup", if there are threads waiting due to alloc failure, invoke ShenandoahControll:notify_alloc_failure_waiters().
- links to
-
Review(master) openjdk/shenandoah/479