-
Bug
-
Resolution: Unresolved
-
P5
-
None
-
None
We have observed in some artificially contrived "extreme" workloads that GenShen (and even traditional Shenandoah) may be a little too eager to throw OOM.
This has been observed with an Extremem workload that runs (for testing purposes) with -XX:+ShenandoahVerify and -XX:+ShenandoahAllocFailureALot. The effect of +ShenandoahVerifiy is to cause GC to run very slowly, spending hundreds of seconds in each GC cycle doing verification. The effect of +ShenandoahAllocFailureALot is to cause frequent degenerated cycles, causing hundreds of mutator threads in each cycle to wait for completion of a subsequent Full GC. When that Full GC is eventually completes, all these hundreds of waiting mutator threads wake up and reissue their pending allocation requests, concurrently with other running threads that are also making allocation requests. This "storm" of concurrent allocation requests often results in some of the retried allocation requests failing because new mutator requests have consumed the recently replenished memory, and this causes OOM to be thrown.
We know, however, that there is sufficient memory within the heap to run this workload (albeit very slowly due to the ShenandoahVerify interference).
A possible improvement would be to tally up the allocation requests that are known to be waiting for Full GC to complete. If the amount of memory available in young at the end of Full GC is at least as large as the total pending allocation requests, then we refrain from throwing OOM and instead wait for yet another Full GC to complete before retrying the allocation request. We need this extra check because we cannot guarantee that all of the memory reclaimed by Full GC will be granted to the waiting allocators. Some of this memory may be consumed by concurrently running mutators which had not been blocked waiting for an allocation to retry.
Reproducer:
```
~/github/shenandoah.backport-freeset/build/linux-x86_64-server-fastdebug/jdk/bin/java \
-XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms10g -Xmx10g \
-XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions \
-XX:-ShenandoahPacing -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational \
-XX:+UnlockDiagnosticVMOptions -XX:+ShenandoahVerify -XX:+ShenandoahAllocFailureALot \
-XX:-UseStringDeduplication -XX:ShenandoahGuaranteedOldGCInterval=120000 \
-javaagent:$HOME/lib/jHiccup-2.0.10/jHiccup.jar=-l,results/genshen/extremem/extremem.jhiccup.log,-i,1000,-a \
-Xlog:async -Xlog:gc*=info -Xlog:safepoint=info \
-jar ~/github/heapothesys/Extremem/target/extremem-1.0-SNAPSHOT.jar \
-dInitializationDelay=5m -dDictionarySize=5000000 -dNumCustomers=9000000 -dNumProducts=22000 \
-dCustomerThreads=700 -dCustomerPeriod=2s -dCustomerThinkTime=1s -dKeywordSearchCount=4 \
-dSelectionCriteriaCount=8 -dBuyThreshold=0.4 -dSaveForLaterThreshold=0.4 -dServerThreads=5 \
-dServerPeriod=5s -dProductNameLength=10 -dBrowsingHistoryQueueCount=5 -dSalesTransactionQueueCount=5 \
-dProductDescriptionLength=64 -dProductReviewLength=32 -dProductReplacementPeriod=25s \
-dProductReplacementCount=50 -dCustomerReplacementPeriod=30s -dCustomerReplacementCount=1000 \
-dBrowsingExpiration=1m -dPhasedUpdates=true -dPhasedUpdateInterval=30s -dSimulationDuration=20m \
-dReportCSV=true -dResponseTimeMeasurements=100000 > trial.$COUNT.out 2>trial.$COUNT.err
```
Though this problem appears to occur most commonly with "-XX:+UnlockDiagnosticVMOptions -XX:+ShenandoahVerify -XX:+ShenandoahAllocFailureALot", I believe it has also been observed even without these flags (much more rarely).
A GC log with "extra diagnostic" info to help reveal under-the-hood behavior of the existing implementation is attached.
This has been observed with an Extremem workload that runs (for testing purposes) with -XX:+ShenandoahVerify and -XX:+ShenandoahAllocFailureALot. The effect of +ShenandoahVerifiy is to cause GC to run very slowly, spending hundreds of seconds in each GC cycle doing verification. The effect of +ShenandoahAllocFailureALot is to cause frequent degenerated cycles, causing hundreds of mutator threads in each cycle to wait for completion of a subsequent Full GC. When that Full GC is eventually completes, all these hundreds of waiting mutator threads wake up and reissue their pending allocation requests, concurrently with other running threads that are also making allocation requests. This "storm" of concurrent allocation requests often results in some of the retried allocation requests failing because new mutator requests have consumed the recently replenished memory, and this causes OOM to be thrown.
We know, however, that there is sufficient memory within the heap to run this workload (albeit very slowly due to the ShenandoahVerify interference).
A possible improvement would be to tally up the allocation requests that are known to be waiting for Full GC to complete. If the amount of memory available in young at the end of Full GC is at least as large as the total pending allocation requests, then we refrain from throwing OOM and instead wait for yet another Full GC to complete before retrying the allocation request. We need this extra check because we cannot guarantee that all of the memory reclaimed by Full GC will be granted to the waiting allocators. Some of this memory may be consumed by concurrently running mutators which had not been blocked waiting for an allocation to retry.
Reproducer:
```
~/github/shenandoah.backport-freeset/build/linux-x86_64-server-fastdebug/jdk/bin/java \
-XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms10g -Xmx10g \
-XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions \
-XX:-ShenandoahPacing -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational \
-XX:+UnlockDiagnosticVMOptions -XX:+ShenandoahVerify -XX:+ShenandoahAllocFailureALot \
-XX:-UseStringDeduplication -XX:ShenandoahGuaranteedOldGCInterval=120000 \
-javaagent:$HOME/lib/jHiccup-2.0.10/jHiccup.jar=-l,results/genshen/extremem/extremem.jhiccup.log,-i,1000,-a \
-Xlog:async -Xlog:gc*=info -Xlog:safepoint=info \
-jar ~/github/heapothesys/Extremem/target/extremem-1.0-SNAPSHOT.jar \
-dInitializationDelay=5m -dDictionarySize=5000000 -dNumCustomers=9000000 -dNumProducts=22000 \
-dCustomerThreads=700 -dCustomerPeriod=2s -dCustomerThinkTime=1s -dKeywordSearchCount=4 \
-dSelectionCriteriaCount=8 -dBuyThreshold=0.4 -dSaveForLaterThreshold=0.4 -dServerThreads=5 \
-dServerPeriod=5s -dProductNameLength=10 -dBrowsingHistoryQueueCount=5 -dSalesTransactionQueueCount=5 \
-dProductDescriptionLength=64 -dProductReviewLength=32 -dProductReplacementPeriod=25s \
-dProductReplacementCount=50 -dCustomerReplacementPeriod=30s -dCustomerReplacementCount=1000 \
-dBrowsingExpiration=1m -dPhasedUpdates=true -dPhasedUpdateInterval=30s -dSimulationDuration=20m \
-dReportCSV=true -dResponseTimeMeasurements=100000 > trial.$COUNT.out 2>trial.$COUNT.err
```
Though this problem appears to occur most commonly with "-XX:+UnlockDiagnosticVMOptions -XX:+ShenandoahVerify -XX:+ShenandoahAllocFailureALot", I believe it has also been observed even without these flags (much more rarely).
A GC log with "extra diagnostic" info to help reveal under-the-hood behavior of the existing implementation is attached.