-
Enhancement
-
Resolution: Unresolved
-
P4
-
9, 10, 11, 12, 13
While migrating our production services from CMS to G1, we found that G1’s complicated write post-barrier incurs considerable CPU cost. Currently the post-barrier is the following for a write “p.f = q”:
if ((p xor q) >> LOG_REGION_BITS != 0) { // if the write crosses region boundary
if (q != null) {
card_address = &card_table[addr_to_index(p)]
if (*card_address != YOUNG) {
store_load_fence;
if (*card_address != DIRTY) {
*card_address = DIRTY;
T.dirtyCardQueue.enqueue(card_address);
}
}
}
}
And for CMS the write barrier is only:
card_address = &card_table[addr_to_index(p)]
*card_address = DIRTY;
The complexity of G1’s write post-barrier is due to the need to support concurrently refinement threads. However, even if user has set -XX:G1ConcRefinementThreads=0, the write post-barrier remains the same. Ideally the write post-barrier could be much simpler if there is no concurrent refinement.
This RFE proposes to add a mode to G1 that uses a simplified write post-barrier:
if ((p xor q) >> LOG_REGION_BITS != 0) {
if (q != null) {
card_address = &card_table[addr_to_index(p)]
*card_address = DIRTY;
}
}
In this mode, G1 would disable concurrent refinement and per-Java-thread dirty card queue. G1 would need to process all dirty cards during a collection pause. Thus pause time could become longer, but as long as MaxGCPauseMillis is reasonably large with regard to the heap size, G1’s adaptive heuristics should still be able to adjust the young-gen size to meet the pause time goal.
This new mode would reduce G1’s CPU usage considerably. It will be particularly helpful for certain types of workloads, e.g.:
- Workloads heavily tuned for CMS to minimize old-gen collections, and sensitive to CPU usage;
- Workloads that mainly care about throughput and CPU usage;
I have implemented a prototype for this mode, and attached some preliminary results.
if ((p xor q) >> LOG_REGION_BITS != 0) { // if the write crosses region boundary
if (q != null) {
card_address = &card_table[addr_to_index(p)]
if (*card_address != YOUNG) {
store_load_fence;
if (*card_address != DIRTY) {
*card_address = DIRTY;
T.dirtyCardQueue.enqueue(card_address);
}
}
}
}
And for CMS the write barrier is only:
card_address = &card_table[addr_to_index(p)]
*card_address = DIRTY;
The complexity of G1’s write post-barrier is due to the need to support concurrently refinement threads. However, even if user has set -XX:G1ConcRefinementThreads=0, the write post-barrier remains the same. Ideally the write post-barrier could be much simpler if there is no concurrent refinement.
This RFE proposes to add a mode to G1 that uses a simplified write post-barrier:
if ((p xor q) >> LOG_REGION_BITS != 0) {
if (q != null) {
card_address = &card_table[addr_to_index(p)]
*card_address = DIRTY;
}
}
In this mode, G1 would disable concurrent refinement and per-Java-thread dirty card queue. G1 would need to process all dirty cards during a collection pause. Thus pause time could become longer, but as long as MaxGCPauseMillis is reasonably large with regard to the heap size, G1’s adaptive heuristics should still be able to adjust the young-gen size to meet the pause time goal.
This new mode would reduce G1’s CPU usage considerably. It will be particularly helpful for certain types of workloads, e.g.:
- Workloads heavily tuned for CMS to minimize old-gen collections, and sensitive to CPU usage;
- Workloads that mainly care about throughput and CPU usage;
I have implemented a prototype for this mode, and attached some preliminary results.
- relates to
-
JDK-8253230 G1 20% slower than Parallel in JRuby rubykon benchmark
- Open
-
JDK-8226731 Remove StoreLoad in G1 post barrier
- Open
-
JDK-8229049 JEP 363: Remove the Concurrent Mark Sweep (CMS) Garbage Collector
- Closed
-
JDK-8230187 Throughput post-write barrier for G1
- Draft