Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8215935

G1 Old Gen MemoryPool CollectionUsage.used values don't reflect mixed GC results

XMLWordPrintable

    • Icon: CSR CSR
    • Resolution: Approved
    • Icon: P4 P4
    • openjdk8u212
    • hotspot
    • None
    • gc
    • behavioral
    • minimal
    • Hide
      This CSR changes incorrect behavior to correct behavior. Users will generally see higher CollectionUsage numbers which may trigger existing alarms which are set too low. It would be unusual, however, since most such alarms are set to well above half the heap and mixed collections typically reduce the old gen size to well under half the heap. Further, without the fix, CollectionUsage values are zero until after a full stop-the-world GC, which never or very rarely happens in normal operation. Given that, it's unlikely anyone is actually using the value, since it never changes.
      Show
      This CSR changes incorrect behavior to correct behavior. Users will generally see higher CollectionUsage numbers which may trigger existing alarms which are set too low. It would be unusual, however, since most such alarms are set to well above half the heap and mixed collections typically reduce the old gen size to well under half the heap. Further, without the fix, CollectionUsage values are zero until after a full stop-the-world GC, which never or very rarely happens in normal operation. Given that, it's unlikely anyone is actually using the value, since it never changes.
    • Other

      Summary

      The "G1 Old Gen" MemoryPoolMXBean is added to the "G1 Young Collection" GarbageCollectorMXBean memory pool list. Code that iterates over the list will see the addition. "G1 Old Gen" CollectionUsage.used is updated after a mixed collection, where now it's updated only after a full-heap stop-the-world GC. The "G1 Old Gen" CollectionUsage.used value series is changed accordingly.

      Problem

      At Amazon, we want to measure long term heap occupancy with a view to detecting long term memory leaks and unexpected steady load increases. For this, we find CollectionUsage to be a more useful metric than Usage. The latter measures instantaneous heap occupancy, which contains an unknown amount of yet-to-be-collected garbage, so we can't determine where to set an alarm on it. Too low and we get false positives due to garbage that will shortly be collected. Too high and we don't detect a failing JVM in time to do anything about it. CollectionUsage, on the other hand, is a good proxy for the amount of long term live data and can be usefully alarmed on. At present, however, it's useless for G1 because it doesn't reflect the result of mixed collections on the old gen.

      Solution

      The root cause is that the "G1 Young Collection" GarbageCollectorMXBean records the result of a collection only on the "G1 Eden Space" and "G1 Survivor Space" memory pools, even in the case of a mixed collection. It is in fact misnamed: "G1 Incremental Collection" would be a better description of what it actually does.

      Adding the "G1 Old Gen" memory pool to the "G1 Young Collection" memory pool list doesn't violate the existing spec, because the spec doesn't define the pools for a collector. This CSR argues that recording the effect of mixed collections on the "G1 Old Gen" memory pool is correct behavior, that correct behavior is more important than legacy behavior, and that the current "G1 Old Gen" CollectionUsage value is of such limited use that changing it to something usable is low risk.

      G1 full GCs happen rarely and only under severe pressure, so when they do, external reaction is pretty much limited to reducing load so the JVM can get back to a usable steady state, or just restarting the JVM. Neither action cares about the value of CollectionUsage. G1 old gen CollectionUsage is zero until a full GC occurs, after which its value includes both long-lived objects and any transient data that was in eden and the survivor space. That value doesn't tell you anything about long term old gen or survivor size because it lumps them them together. So, it isn't a useful metric, nor will it be after any subsequent full GCs. The only information it provides is on the first zero to non-zero transition, which just tells you that the JVM is, or was, in trouble. Further, the effect of the runup to a full GC is SLA violations, which are noticed before the full GC happens, so detecting the first full GC is unneeded confirmation, not prediction.

      Specification

      There are no specification changes. Behavioral changes are discussed above.

            phh Paul Hohensee
            phh Paul Hohensee
            Thomas Schatzl
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: