Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8257774

G1: Trigger collect when free region count drops below threshold to prevent evacuation failures

XMLWordPrintable

    • gc
    • b26

      In most workloads G1 will not encounter evacuation failures. In the past, when I have encountered evacuation failures due to To-space exhaustion they were not that costly since most objects were evacuated before the exhaustion happened. Recently, I have been encountering severe cases of To-space exhaustion where there are 0 free regions when the GC is initiated. These GCs are extremely long and result in 0 objects being evacuated.

      The situation is happening because of bursts of short lived humongous object allocations. These bursts quickly consume all of the G1ReservePercent regions and then the rest of the free regions. This results in a collect which is regularly 5-10x longer than the normal collections. The "Evacuate Collection Set" times (20-40ms) are similar but "Post Evacuate Collection Set" jumps to over 300ms due to "Remove Self Forwards". Luckily, most of the humongous regions are reclaimed during the collect so a Full collect is not immediately initiated. We are encountering this in JDK 11 but we are able to reproduce with JDK 15 and tip. This issue could be worked around by tuning the heap, G1ReservedPercent and G1HeapRegionSize but this forces the JVM to be significantly over provisioned most of the time.

      The G1 collector keeps sufficient data about survival rates for Eden and survivor along with live bytes for Tenure, etc. so I believe we can resolve this issue. I am proposing that these stats are used to decide if there are enough Free regions available to sufficiently handle all evacuations if a GC was to happen at any point during the slow allocation path. At any point there is not sufficient free regions force a GC to happen.

      Most of the calculations can be constant between 2 GC events so the cost to do this check on the slow allocation path should be minimal.

      I have a working prototype available here:
      https://github.com/charliegracie/jdk/tree/humongous_regions

      Here is an example of a GC that exhibits this failure case:
      [12338.350s][info ][gc,start ] GC(910) Pause Young (Prepare Mixed) (G1 Evacuation Pause)
      [12338.350s][info ][gc,task ] GC(910) Using 8 workers of 8 for evacuation
      [12338.351s][debug][gc,age ] GC(910) Desired survivor size 352321536 bytes, new threshold 15 (max threshold 15)
      [12338.401s][debug][gc,ref ] GC(910) Skipped phase1 of Reference Processing due to unavailable references
      [12338.415s][debug][gc,ref ] GC(910) Skipped phase3 of Reference Processing due to unavailable references
      [12338.781s][trace][gc,age ] GC(910) Age table with threshold 15 (max threshold 15)
      [12338.784s][info ][gc,mmu ] GC(910) MMU target violated: 101.0ms (100.0ms/101.0ms)
      [12338.784s][info ][gc ] GC(910) To-space exhausted
      [12338.784s][info ][gc,phases ] GC(910) Pre Evacuate Collection Set: 0.5ms
      [12338.785s][debug][gc,phases ] GC(910) Prepare TLABs: 0.5ms
      [12338.785s][debug][gc,phases ] GC(910) Choose Collection Set: 0.0ms
      [12338.785s][debug][gc,phases ] GC(910) Humongous Register: 0.5ms
      [12338.785s][trace][gc,phases ] GC(910) Humongous Total: 907
      [12338.785s][trace][gc,phases ] GC(910) Humongous Candidate: 907
      [12338.785s][info ][gc,phases ] GC(910) Evacuate Collection Set: 47.7ms
      [12338.785s][trace][gc,phases ] GC(910) GC Worker Start (ms): Min: 12338352.6, Avg: 12338352.6, Max: 12338352.6, Diff: 0.0, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Ext Root Scanning (ms): Min: 1.2, Avg: 2.1, Max: 2.9, Diff: 1.6, Sum: 17.1, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) Thread Roots (ms): Min: 4.3, Avg: 7.6, Max: 13.7, Diff: 9.5, Sum: 61.1, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) StringTable Roots (ms): Min: 0.0, Avg: 0.5, Max: 1.6, Diff: 1.6, Sum: 3.7, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) Universe Roots (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) JNI Handles Roots (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) ObjectSynchronizer Roots (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) Management Roots (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) SystemDictionary Roots (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) CLDG Roots (ms): Min: 0.0, Avg: 0.2, Max: 1.5, Diff: 1.5, Sum: 1.5, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) JVMTI Roots (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) CM RefProcessor Roots (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) Wait For Strong CLD (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) Weak CLD Roots (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) SATB Filtering (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Update RS (ms): Min: 7.8, Avg: 9.0, Max: 10.3, Diff: 2.5, Sum: 72.0, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Processed Buffers: Min: 68, Avg: 127.2, Max: 164, Diff: 96, Sum: 1018, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Scanned Cards: Min: 4633, Avg: 5498.5, Max: 6277, Diff: 1644, Sum: 43988, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Skipped Cards: Min: 0, Avg: 0.2, Max: 1, Diff: 1, Sum: 2, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) Scan HCC (ms): Min: 0.0, Avg: 0.4, Max: 0.7, Diff: 0.7, Sum: 2.8, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Scan RS (ms): Min: 3.9, Avg: 7.4, Max: 9.5, Diff: 5.6, Sum: 59.0, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Scanned Cards: Min: 3573, Avg: 6400.0, Max: 8146, Diff: 4573, Sum: 51200, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Claimed Cards: Min: 10586, Avg: 21406.6, Max: 32218, Diff: 21632, Sum: 171253, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Skipped Cards: Min: 24465, Avg: 50612.8, Max: 59202, Diff: 34737, Sum: 404902, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) AOT Root Scanning (ms): skipped
      [12338.785s][debug][gc,phases ] GC(910) Object Copy (ms): Min: 27.0, Avg: 28.7, Max: 33.4, Diff: 6.4, Sum: 229.3, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) Termination Attempts: Min: 1, Avg: 18.5, Max: 25, Diff: 24, Sum: 148, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) GC Worker Other (ms): Min: 0.1, Avg: 0.1, Max: 0.1, Diff: 0.0, Sum: 0.6, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) GC Worker Total (ms): Min: 47.6, Avg: 47.6, Max: 47.6, Diff: 0.1, Sum: 381.0, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) GC Worker End (ms): Min: 12338400.2, Avg: 12338400.2, Max: 12338400.2, Diff: 0.0, Workers: 8
      [12338.785s][info ][gc,phases ] GC(910) Post Evacuate Collection Set: 383.4ms
      [12338.785s][debug][gc,phases ] GC(910) Code Roots Fixup: 0.1ms
      [12338.785s][debug][gc,phases ] GC(910) Clear Card Table: 0.4ms
      [12338.785s][debug][gc,phases ] GC(910) Reference Processing: 15.0ms
      [12338.785s][debug][gc,phases ] GC(910) Weak Processing: 0.3ms
      [12338.785s][debug][gc,phases ] GC(910) Evacuation Failure: 363.9ms
      [12338.785s][trace][gc,phases ] GC(910) Recalculate Used: 0.1ms
      [12338.785s][trace][gc,phases ] GC(910) Remove Self Forwards: 363.8ms
      [12338.785s][debug][gc,phases ] GC(910) Merge Per-Thread State: 0.0ms
      [12338.785s][debug][gc,phases ] GC(910) Code Roots Purge: 0.0ms
      [12338.785s][debug][gc,phases ] GC(910) Redirty Cards: 0.8ms
      [12338.785s][trace][gc,phases ] GC(910) Parallel Redirty (ms): Min: 0.7, Avg: 0.7, Max: 0.7, Diff: 0.0, Sum: 5.7, Workers: 8
      [12338.785s][trace][gc,phases ] GC(910) Redirtied Cards: Min: 61725, Avg: 63779.0, Max: 67785, Diff: 6060, Sum: 510232, Workers: 8
      [12338.785s][debug][gc,phases ] GC(910) DerivedPointerTable Update: 0.4ms
      [12338.785s][debug][gc,phases ] GC(910) Free Collection Set: 0.2ms
      [12338.785s][trace][gc,phases ] GC(910) Free Collection Set Serial: 0.1ms
      [12338.786s][trace][gc,phases ] GC(910) Young Free Collection Set (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.0, Sum: 0.3, Workers: 7
      [12338.786s][trace][gc,phases ] GC(910) Non-Young Free Collection Set (ms): skipped
      [12338.786s][debug][gc,phases ] GC(910) Humongous Reclaim: 2.9ms
      [12338.786s][trace][gc,phases ] GC(910) Humongous Reclaimed: 905
      [12338.786s][debug][gc,phases ] GC(910) Start New Collection Set: 0.0ms
      [12338.786s][debug][gc,phases ] GC(910) Resize TLABs: 0.2ms
      [12338.786s][debug][gc,phases ] GC(910) Expand Heap After Collection: 0.0ms
      [12338.786s][info ][gc,phases ] GC(910) Other: 2.8ms
      [12338.786s][info ][gc,heap ] GC(910) Eden regions: 657->0(89)
      [12338.786s][info ][gc,heap ] GC(910) Survivor regions: 7->0(84)
      [12338.786s][info ][gc,heap ] GC(910) Old regions: 221->884
      [12338.786s][info ][gc,heap ] GC(910) Humongous regions: 907->2
      [12338.786s][info ][gc,metaspace ] GC(910) Metaspace: 179166K->179166K(1212416K)
      [12338.786s][info ][gc ] GC(910) Pause Young (Prepare Mixed) (G1 Evacuation Pause) 14331M->7083M(14336M) 435.483ms
      [12338.786s][info ][gc,cpu ] GC(910) User=3.36s Sys=0.02s Real=0.44s

            adityam Aditya Mandaleeka
            cgracie Charlie Gracie
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: