-
Enhancement
-
Resolution: Unresolved
-
P2
-
None
If we see ZGC allocation stalls, that mean that the allocation rate is higher than what ZGC can keep up with, and you’re running out of available heap space; ZGC has slowed down the application threads.
There are currently no specific allocation stall events, but there is a counter event containing the information. A separate allocation stall event may be introduced. Either way, we can look at the counter event for now. Note that the required event is off by default. This is fine, as the fact that the event is off will be explicitly shown using our standard rules mechanisms (rule will be N/A, and the dependent event type(s) will be listed in the message letting the user know that the event(s) must be on for the rule to work).
There are a few things that can be tried if this is found:
a) Increase the number of concurrent GC threads. This will help ZGC win the race. In your first GC log, there are 8 concurrent GC threads. It probably needs 10 or 12 concurrent GC threads in the absence of making other changes.
b) Increase the size of the Java heap to offer ZGC additional head room.
c) Make changes to the application to either reduce the amount of live data, or reduce the allocation rate.
We can likely make the rule more interesting by including data from allocation events, heap statistics and (possibly) old object sample events. That said, they should not be required for the rule to trigger. If data is available, however, we can probably add some flavour to the suggestions in 'b' and 'c'.