-
Enhancement
-
Resolution: Duplicate
-
P4
-
11, 17, 21, 22
Was working on some GC tests, and notices the major oddity in Parallel GC behavior. The workload like this:
```
import java.util.concurrent.ThreadLocalRandom;
public class Fragger {
static final int COUNT = 15_000_000;
static final Object[] ARR = new Object[COUNT];
public static void main(String... args) {
for (int c = 0; c < COUNT; c++) {
ARR[c] = new byte[16];
}
long end = System.nanoTime() + 30_000_000_000L;
while (System.nanoTime() < end) {
int idxFull = ThreadLocalRandom.current().nextInt(COUNT);
ARR[idxFull] = new byte[16];
}
}
}
```
With Parallel, Full GC times are out of whack:
```
$ build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseParallelGC -Xlog:gc Fragger.java 2>&1 | grep "Pause Full"
[1.333s][info][gc] GC(2) Pause Full (Ergonomics) 530M->471M(981M) 652.579ms
[3.315s][info][gc] GC(3) Pause Full (Ergonomics) 727M->517M(981M) 1721.082ms
[5.904s][info][gc] GC(4) Pause Full (Ergonomics) 773M->517M(981M) 2283.419ms
[9.001s][info][gc] GC(5) Pause Full (Ergonomics) 773M->517M(981M) 2673.602ms
[12.439s][info][gc] GC(6) Pause Full (Ergonomics) 773M->517M(981M) 3013.188ms
[16.079s][info][gc] GC(7) Pause Full (Ergonomics) 773M->517M(981M) 3219.932ms
[19.654s][info][gc] GC(8) Pause Full (Ergonomics) 773M->517M(981M) 3247.373ms
[23.692s][info][gc] GC(9) Pause Full (Ergonomics) 773M->517M(981M) 3705.432ms
[27.382s][info][gc] GC(10) Pause Full (Ergonomics) 773M->517M(981M) 3346.816ms
[31.079s][info][gc] GC(11) Pause Full (Ergonomics) 773M->517M(981M) 3366.989ms
```
For comparison, Shenandoah's Full GC is 7x (!) faster:
```
$ build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseShenandoahGC -XX:+UnlockDiagnosticVMOptions -XX:ShenandoahGCMode=passive -XX:-ShenandoahDegeneratedGC -Xlog:gc Fragger.java 2>&1 | grep "Pause Full"
[1.290s][info][gc] GC(0) Pause Full 1023M->517M(1024M) 389.849ms
[2.328s][info][gc] GC(1) Pause Full 1023M->517M(1024M) 422.842ms
[3.353s][info][gc] GC(2) Pause Full 1023M->517M(1024M) 432.260ms
[4.378s][info][gc] GC(3) Pause Full 1023M->517M(1024M) 433.945ms
[5.501s][info][gc] GC(4) Pause Full 1023M->517M(1024M) 439.280ms
[6.527s][info][gc] GC(5) Pause Full 1023M->517M(1024M) 433.924ms
[7.601s][info][gc] GC(6) Pause Full 1023M->517M(1024M) 430.845ms
[8.629s][info][gc] GC(7) Pause Full 1023M->517M(1024M) 426.524ms
[9.675s][info][gc] GC(8) Pause Full 1023M->517M(1024M) 439.208ms
```
Even Serial is faster!
```
$ build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseSerialGC -Xlog:gc Fragger.java 2>&1 | grep "Pause Full"
[25.993s][info][gc] GC(35) Pause Full (Allocation Failure) 989M->552M(989M) 978.552ms
[27.330s][info][gc] GC(37) Pause Full (Allocation Failure) 825M->552M(989M) 972.567ms
[28.683s][info][gc] GC(39) Pause Full (Allocation Failure) 825M->552M(989M) 986.712ms
[30.050s][info][gc] GC(41) Pause Full (Allocation Failure) 825M->517M(989M) 1003.309ms
[31.409s][info][gc] GC(43) Pause Full (Allocation Failure) 790M->552M(989M) 992.085ms
```
I think part of it is scalability problem. Running with -XX:ParallelGCThreads=1 shortens the gap against Shenandoah to "only" 2x, and Parallel is 3x slower than Serial. Hacking PSParallelCompact::enqueue_dense_prefix_tasks to hand out a region per thread and adding region distribution logging shows that one region takes the exorbitant amount of time to process.
```
[6.518s][info][gc] GC(4) Worker 0 finished regions [0; 1) in 93 us
[9.554s][info][gc] GC(4) Worker 0 finished regions [1; 2) in 3035901 us
[9.554s][info][gc] GC(4) Worker 0 finished regions [2; 3) in 1 us
[9.554s][info][gc] GC(4) Worker 0 finished regions [3; 4) in 0 us
... all other regions take 0 us too...
[9.731s][info][gc] GC(4) Pause Full (Ergonomics) 773M->517M(981M) 4508.276ms
```
The sample profile points to live_words_in_range as the bottleneck, see "asprof-parallelgc-1thread.html`. The profile gets worse with the default number of parallel GC threads, as all those workers are spinning waiting for a loner worker to complete.
I think that one loner thread is busy iterating the large reference array and figuring out the new locations for objects referenced from it. It does not matter if that array spans multiple regions or not: the walk would still be single-threaded. Plus, since the referenced objects are not guaranteed to be laid out linearly in memory, the cache (JDK-8145318) is not effective.
So we enter the slow path, and to compute the new address, we need compute the number of live words in the block that preceed the object (since those objects would preceed the given object after the sliding). Which means, if we randomly iterate over many blocks, we end up doing one bitmap search for every object, which costs O(N) of number of live objects in the block. If most objects in the block are live, then it amounts to O(N^2) total time to deal with entire thing, where N is the average number of objects per block. Even if the block is small (currently 128 words), N could be large for the ensemble of small objects: about 8 words long object gives 32 objects per block.
This is corroborated by simple performance experiment: twiddling `ParallelCompactData::Log2BlockSize`.
```
# ParallelCompactData::Log2BlockSize = 16 (maximum, fills entire region)
[8.118s][info][gc] GC(3) Pause Full (Ergonomics) 727M->518M(981M) 6134.907ms
[17.144s][info][gc] GC(4) Pause Full (Ergonomics) 774M->517M(981M) 8663.333ms
[28.121s][info][gc] GC(5) Pause Full (Ergonomics) 773M->517M(981M) 10606.085ms
[39.806s][info][gc] GC(6) Pause Full (Ergonomics) 773M->517M(981M) 11390.484ms
# ParallelCompactData::Log2BlockSize = 7 (default)
[22.000s][info][gc] GC(9) Pause Full (Ergonomics) 773M->517M(981M) 3052.236ms
[25.388s][info][gc] GC(10) Pause Full (Ergonomics) 773M->517M(981M) 3094.713ms
[29.115s][info][gc] GC(11) Pause Full (Ergonomics) 773M->517M(981M) 3388.381ms
[33.196s][info][gc] GC(12) Pause Full (Ergonomics) 773M->517M(981M) 3735.875ms
# ParallelCompactData::Log2BlockSize = 4
[26.898s][info][gc] GC(19) Pause Full (Ergonomics) 774M->518M(981M) 1166.031ms
[28.191s][info][gc] GC(20) Pause Full (Ergonomics) 774M->518M(981M) 991.396ms
[29.643s][info][gc] GC(21) Pause Full (Ergonomics) 774M->518M(981M) 1152.030ms
[31.055s][info][gc] GC(22) Pause Full (Ergonomics) 774M->518M(981M) 979.091ms
# ParallelCompactData::Log2BlockSize = 1 (minimum)
[28.284s][info][gc] GC(23) Pause Full (Ergonomics) 774M->518M(981M) 880.228ms
[29.501s][info][gc] GC(24) Pause Full (Ergonomics) 774M->518M(981M) 882.738ms
[30.662s][info][gc] GC(25) Pause Full (Ergonomics) 774M->517M(981M) 844.468ms
[31.871s][info][gc] GC(26) Pause Full (Ergonomics) 773M->517M(981M) 873.624ms
```
It probably a design/implementation question: can we do the pointer calculation like Serial, G1 and Shenandoah Full GCs do, go linearly over the whole objects in the block/region/heap and compute all addresses in one swing, instead of going to possibly random objects and trying to figure out new addresses on the spot? I think current scheme is the part of Scissor, but I need to brush up on that. It would probably require storing the fwdptr information somewhere.
If not, can we do smaller block size to limit the impact of live bitmap scans?
If not, can we chunk the array walks in compact to avoid the single-threaded execution like in these examples?
```
import java.util.concurrent.ThreadLocalRandom;
public class Fragger {
static final int COUNT = 15_000_000;
static final Object[] ARR = new Object[COUNT];
public static void main(String... args) {
for (int c = 0; c < COUNT; c++) {
ARR[c] = new byte[16];
}
long end = System.nanoTime() + 30_000_000_000L;
while (System.nanoTime() < end) {
int idxFull = ThreadLocalRandom.current().nextInt(COUNT);
ARR[idxFull] = new byte[16];
}
}
}
```
With Parallel, Full GC times are out of whack:
```
$ build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseParallelGC -Xlog:gc Fragger.java 2>&1 | grep "Pause Full"
[1.333s][info][gc] GC(2) Pause Full (Ergonomics) 530M->471M(981M) 652.579ms
[3.315s][info][gc] GC(3) Pause Full (Ergonomics) 727M->517M(981M) 1721.082ms
[5.904s][info][gc] GC(4) Pause Full (Ergonomics) 773M->517M(981M) 2283.419ms
[9.001s][info][gc] GC(5) Pause Full (Ergonomics) 773M->517M(981M) 2673.602ms
[12.439s][info][gc] GC(6) Pause Full (Ergonomics) 773M->517M(981M) 3013.188ms
[16.079s][info][gc] GC(7) Pause Full (Ergonomics) 773M->517M(981M) 3219.932ms
[19.654s][info][gc] GC(8) Pause Full (Ergonomics) 773M->517M(981M) 3247.373ms
[23.692s][info][gc] GC(9) Pause Full (Ergonomics) 773M->517M(981M) 3705.432ms
[27.382s][info][gc] GC(10) Pause Full (Ergonomics) 773M->517M(981M) 3346.816ms
[31.079s][info][gc] GC(11) Pause Full (Ergonomics) 773M->517M(981M) 3366.989ms
```
For comparison, Shenandoah's Full GC is 7x (!) faster:
```
$ build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseShenandoahGC -XX:+UnlockDiagnosticVMOptions -XX:ShenandoahGCMode=passive -XX:-ShenandoahDegeneratedGC -Xlog:gc Fragger.java 2>&1 | grep "Pause Full"
[1.290s][info][gc] GC(0) Pause Full 1023M->517M(1024M) 389.849ms
[2.328s][info][gc] GC(1) Pause Full 1023M->517M(1024M) 422.842ms
[3.353s][info][gc] GC(2) Pause Full 1023M->517M(1024M) 432.260ms
[4.378s][info][gc] GC(3) Pause Full 1023M->517M(1024M) 433.945ms
[5.501s][info][gc] GC(4) Pause Full 1023M->517M(1024M) 439.280ms
[6.527s][info][gc] GC(5) Pause Full 1023M->517M(1024M) 433.924ms
[7.601s][info][gc] GC(6) Pause Full 1023M->517M(1024M) 430.845ms
[8.629s][info][gc] GC(7) Pause Full 1023M->517M(1024M) 426.524ms
[9.675s][info][gc] GC(8) Pause Full 1023M->517M(1024M) 439.208ms
```
Even Serial is faster!
```
$ build/macosx-aarch64-server-release/images/jdk/bin/java -Xmx1g -Xms1g -XX:+UseSerialGC -Xlog:gc Fragger.java 2>&1 | grep "Pause Full"
[25.993s][info][gc] GC(35) Pause Full (Allocation Failure) 989M->552M(989M) 978.552ms
[27.330s][info][gc] GC(37) Pause Full (Allocation Failure) 825M->552M(989M) 972.567ms
[28.683s][info][gc] GC(39) Pause Full (Allocation Failure) 825M->552M(989M) 986.712ms
[30.050s][info][gc] GC(41) Pause Full (Allocation Failure) 825M->517M(989M) 1003.309ms
[31.409s][info][gc] GC(43) Pause Full (Allocation Failure) 790M->552M(989M) 992.085ms
```
I think part of it is scalability problem. Running with -XX:ParallelGCThreads=1 shortens the gap against Shenandoah to "only" 2x, and Parallel is 3x slower than Serial. Hacking PSParallelCompact::enqueue_dense_prefix_tasks to hand out a region per thread and adding region distribution logging shows that one region takes the exorbitant amount of time to process.
```
[6.518s][info][gc] GC(4) Worker 0 finished regions [0; 1) in 93 us
[9.554s][info][gc] GC(4) Worker 0 finished regions [1; 2) in 3035901 us
[9.554s][info][gc] GC(4) Worker 0 finished regions [2; 3) in 1 us
[9.554s][info][gc] GC(4) Worker 0 finished regions [3; 4) in 0 us
... all other regions take 0 us too...
[9.731s][info][gc] GC(4) Pause Full (Ergonomics) 773M->517M(981M) 4508.276ms
```
The sample profile points to live_words_in_range as the bottleneck, see "asprof-parallelgc-1thread.html`. The profile gets worse with the default number of parallel GC threads, as all those workers are spinning waiting for a loner worker to complete.
I think that one loner thread is busy iterating the large reference array and figuring out the new locations for objects referenced from it. It does not matter if that array spans multiple regions or not: the walk would still be single-threaded. Plus, since the referenced objects are not guaranteed to be laid out linearly in memory, the cache (JDK-8145318) is not effective.
So we enter the slow path, and to compute the new address, we need compute the number of live words in the block that preceed the object (since those objects would preceed the given object after the sliding). Which means, if we randomly iterate over many blocks, we end up doing one bitmap search for every object, which costs O(N) of number of live objects in the block. If most objects in the block are live, then it amounts to O(N^2) total time to deal with entire thing, where N is the average number of objects per block. Even if the block is small (currently 128 words), N could be large for the ensemble of small objects: about 8 words long object gives 32 objects per block.
This is corroborated by simple performance experiment: twiddling `ParallelCompactData::Log2BlockSize`.
```
# ParallelCompactData::Log2BlockSize = 16 (maximum, fills entire region)
[8.118s][info][gc] GC(3) Pause Full (Ergonomics) 727M->518M(981M) 6134.907ms
[17.144s][info][gc] GC(4) Pause Full (Ergonomics) 774M->517M(981M) 8663.333ms
[28.121s][info][gc] GC(5) Pause Full (Ergonomics) 773M->517M(981M) 10606.085ms
[39.806s][info][gc] GC(6) Pause Full (Ergonomics) 773M->517M(981M) 11390.484ms
# ParallelCompactData::Log2BlockSize = 7 (default)
[22.000s][info][gc] GC(9) Pause Full (Ergonomics) 773M->517M(981M) 3052.236ms
[25.388s][info][gc] GC(10) Pause Full (Ergonomics) 773M->517M(981M) 3094.713ms
[29.115s][info][gc] GC(11) Pause Full (Ergonomics) 773M->517M(981M) 3388.381ms
[33.196s][info][gc] GC(12) Pause Full (Ergonomics) 773M->517M(981M) 3735.875ms
# ParallelCompactData::Log2BlockSize = 4
[26.898s][info][gc] GC(19) Pause Full (Ergonomics) 774M->518M(981M) 1166.031ms
[28.191s][info][gc] GC(20) Pause Full (Ergonomics) 774M->518M(981M) 991.396ms
[29.643s][info][gc] GC(21) Pause Full (Ergonomics) 774M->518M(981M) 1152.030ms
[31.055s][info][gc] GC(22) Pause Full (Ergonomics) 774M->518M(981M) 979.091ms
# ParallelCompactData::Log2BlockSize = 1 (minimum)
[28.284s][info][gc] GC(23) Pause Full (Ergonomics) 774M->518M(981M) 880.228ms
[29.501s][info][gc] GC(24) Pause Full (Ergonomics) 774M->518M(981M) 882.738ms
[30.662s][info][gc] GC(25) Pause Full (Ergonomics) 774M->517M(981M) 844.468ms
[31.871s][info][gc] GC(26) Pause Full (Ergonomics) 773M->517M(981M) 873.624ms
```
It probably a design/implementation question: can we do the pointer calculation like Serial, G1 and Shenandoah Full GCs do, go linearly over the whole objects in the block/region/heap and compute all addresses in one swing, instead of going to possibly random objects and trying to figure out new addresses on the spot? I think current scheme is the part of Scissor, but I need to brush up on that. It would probably require storing the fwdptr information somewhere.
If not, can we do smaller block size to limit the impact of live bitmap scans?
If not, can we chunk the array walks in compact to avoid the single-threaded execution like in these examples?
- duplicates
-
JDK-8329203 Parallel: Investigate Mark-Compact for Full GC to decrease memory usage
- Resolved
- relates to
-
JDK-8332969 Improve Parallel Full GC to break even with Serial Full GC sooner
- Open
-
JDK-8310031 Parallel: Implement better work distribution for large object arrays in old gen
- Resolved
-
JDK-8325553 Parallel: Use per-marker cache for marking stats during Full GC
- Resolved
-
JDK-8323791 Parallel: Using non-atomic for live bytes update during Full GC
- Closed