-
Type:
Enhancement
-
Resolution: Fixed
-
Priority:
P4
-
Affects Version/s: 17, 21, 25, 26
-
Component/s: hotspot
-
b27
With https://bugs.openjdk.org/browse/JDK-8336640, we optimized the approach how Shenandoah derive the value of ShenandoahParallelRegionStride, which works fine for non-generational Shenandoah.
For Genshen, the closure passed to the method is usually decorated with a `ShenandoahExcludeRegionClosure` to exclude the regions not belong to current generation, for such case there might be performance regression, the GC workers may not have same amount of regions to process anymore because of the filter logic.
One example is Concurrent Reset/Concurrent Reset After Collect, which need to reset the marking bitmaps for each region of current generation, when I looked into GC logs from specjbb2015 with a 31G heap, the Concurrent Reset/Concurrent Reset could take more than 60ms:
```
[7119.454s][info][gc,stats ] Concurrent Reset = 7.106 s (a = 9338 us) (n = 761) (lvls, us = 5918, 6270, 6562, 7090, 124989)
[7119.454s][info][gc,stats ] Concurrent Reset After Collect = 50.000 s (a = 66137 us) (n = 756) (lvls, us = 2, 55078, 71875, 79688, 133435)
```
I did brief verification of the issue by running dacapo h2 benchmark as below:
```
java -XX:+TieredCompilation -XX:+AlwaysPreTouch -Xms32G -Xmx32G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions -Xlog:gc\* -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational -XX:+UseTLAB -XX:ShenandoahParallelRegionStride=<Stride Value>-jar ~/Downloads/dacapo-23.11-MR2-chopin.jar -n 5 h2 | grep "Concurrent Reset"
```
Here are the results of with ShenandoahParallelRegionStride set to power of 2 from 1 to 4096:
```
[1]
[77.444s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3078 us) (n = 14) (lvls, us = 1172, 1289, 1328, 1406, 14780)
[77.444s][info][gc,stats ] Concurrent Reset After Collect = 0.044 s (a = 3150 us) (n = 14) (lvls, us = 1074, 1504, 1895, 4121, 8952)
[2]
[77.304s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3036 us) (n = 14) (lvls, us = 1152, 1211, 1289, 1328, 14872)
[77.305s][info][gc,stats ] Concurrent Reset After Collect = 0.046 s (a = 3297 us) (n = 14) (lvls, us = 939, 1602, 2148, 3945, 8744)
[4]
[76.898s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3048 us) (n = 14) (lvls, us = 1152, 1230, 1270, 1328, 14989)
[76.898s][info][gc,stats ] Concurrent Reset After Collect = 0.045 s (a = 3215 us) (n = 14) (lvls, us = 1016, 1309, 1914, 3301, 7076)
[8]
[77.916s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3067 us) (n = 14) (lvls, us = 1152, 1211, 1270, 1309, 15091)
[77.916s][info][gc,stats ] Concurrent Reset After Collect = 0.043 s (a = 3050 us) (n = 14) (lvls, us = 1133, 1484, 1934, 3086, 8113)
[16]
[77.071s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3019 us) (n = 14) (lvls, us = 1152, 1250, 1270, 1328, 14615)
[77.071s][info][gc,stats ] Concurrent Reset After Collect = 0.046 s (a = 3284 us) (n = 14) (lvls, us = 932, 1523, 2090, 2930, 8841)
[32]
[76.965s][info][gc,stats ] Concurrent Reset = 0.044 s (a = 3117 us) (n = 14) (lvls, us = 1191, 1211, 1328, 1348, 14768)
[76.965s][info][gc,stats ] Concurrent Reset After Collect = 0.047 s (a = 3323 us) (n = 14) (lvls, us = 930, 1406, 1875, 4316, 8565)
[64]
[77.255s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3033 us) (n = 14) (lvls, us = 1152, 1211, 1270, 1406, 14635)
[77.255s][info][gc,stats ] Concurrent Reset After Collect = 0.054 s (a = 3862 us) (n = 14) (lvls, us = 1133, 1504, 2852, 5508, 8947)
[128]
[76.502s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3027 us) (n = 14) (lvls, us = 1133, 1230, 1250, 1426, 14264)
[76.502s][info][gc,stats ] Concurrent Reset After Collect = 0.053 s (a = 3762 us) (n = 14) (lvls, us = 1172, 1582, 2129, 5273, 9272)
[256]
[76.751s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3057 us) (n = 14) (lvls, us = 1133, 1230, 1270, 1426, 14713)
[76.751s][info][gc,stats ] Concurrent Reset After Collect = 0.056 s (a = 4029 us) (n = 14) (lvls, us = 1484, 1602, 3027, 4629, 11267)
[512]
[77.508s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3082 us) (n = 14) (lvls, us = 1133, 1230, 1270, 1426, 14893)
[77.508s][info][gc,stats ] Concurrent Reset After Collect = 0.068 s (a = 4822 us) (n = 14) (lvls, us = 1953, 2285, 3633, 5605, 16366)
[1024]
[76.933s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3073 us) (n = 14) (lvls, us = 1152, 1211, 1270, 1426, 14957)
[76.933s][info][gc,stats ] Concurrent Reset After Collect = 0.082 s (a = 5877 us) (n = 14) (lvls, us = 1895, 3203, 4258, 7793, 15587)
[2048]
[76.746s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3022 us) (n = 14) (lvls, us = 1133, 1172, 1211, 1406, 14586)
[76.746s][info][gc,stats ] Concurrent Reset After Collect = 0.099 s (a = 7104 us) (n = 14) (lvls, us = 1875, 3281, 4590, 7695, 19292)
[4096]
[77.356s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3031 us) (n = 14) (lvls, us = 1133, 1191, 1250, 1426, 14606)
[77.356s][info][gc,stats ] Concurrent Reset After Collect = 0.101 s (a = 7213 us) (n = 14) (lvls, us = 1914, 3262, 4238, 7871, 19862)
```
As we increase the value for ShenandoahParallelRegionStride, the performance of `Concurrent Reset After Collect` is getter worse.
Few thoughts about the problem:
* Concurrent reset may not be a good candidate to use parallel_heap_region_iterate, because task is not really lightweight.
* For Genshen, if the closure is decorated with ShenandoahExcludeRegionClosure, the default ShenandoahParallelRegionStride shall not be used, because the filter logic in ShenandoahExcludeRegionClosure may cause imbalance workload being assigned to each GC worker threads, causing performance regression.
For Genshen, the closure passed to the method is usually decorated with a `ShenandoahExcludeRegionClosure` to exclude the regions not belong to current generation, for such case there might be performance regression, the GC workers may not have same amount of regions to process anymore because of the filter logic.
One example is Concurrent Reset/Concurrent Reset After Collect, which need to reset the marking bitmaps for each region of current generation, when I looked into GC logs from specjbb2015 with a 31G heap, the Concurrent Reset/Concurrent Reset could take more than 60ms:
```
[7119.454s][info][gc,stats ] Concurrent Reset = 7.106 s (a = 9338 us) (n = 761) (lvls, us = 5918, 6270, 6562, 7090, 124989)
[7119.454s][info][gc,stats ] Concurrent Reset After Collect = 50.000 s (a = 66137 us) (n = 756) (lvls, us = 2, 55078, 71875, 79688, 133435)
```
I did brief verification of the issue by running dacapo h2 benchmark as below:
```
java -XX:+TieredCompilation -XX:+AlwaysPreTouch -Xms32G -Xmx32G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions -Xlog:gc\* -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational -XX:+UseTLAB -XX:ShenandoahParallelRegionStride=<Stride Value>-jar ~/Downloads/dacapo-23.11-MR2-chopin.jar -n 5 h2 | grep "Concurrent Reset"
```
Here are the results of with ShenandoahParallelRegionStride set to power of 2 from 1 to 4096:
```
[1]
[77.444s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3078 us) (n = 14) (lvls, us = 1172, 1289, 1328, 1406, 14780)
[77.444s][info][gc,stats ] Concurrent Reset After Collect = 0.044 s (a = 3150 us) (n = 14) (lvls, us = 1074, 1504, 1895, 4121, 8952)
[2]
[77.304s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3036 us) (n = 14) (lvls, us = 1152, 1211, 1289, 1328, 14872)
[77.305s][info][gc,stats ] Concurrent Reset After Collect = 0.046 s (a = 3297 us) (n = 14) (lvls, us = 939, 1602, 2148, 3945, 8744)
[4]
[76.898s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3048 us) (n = 14) (lvls, us = 1152, 1230, 1270, 1328, 14989)
[76.898s][info][gc,stats ] Concurrent Reset After Collect = 0.045 s (a = 3215 us) (n = 14) (lvls, us = 1016, 1309, 1914, 3301, 7076)
[8]
[77.916s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3067 us) (n = 14) (lvls, us = 1152, 1211, 1270, 1309, 15091)
[77.916s][info][gc,stats ] Concurrent Reset After Collect = 0.043 s (a = 3050 us) (n = 14) (lvls, us = 1133, 1484, 1934, 3086, 8113)
[16]
[77.071s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3019 us) (n = 14) (lvls, us = 1152, 1250, 1270, 1328, 14615)
[77.071s][info][gc,stats ] Concurrent Reset After Collect = 0.046 s (a = 3284 us) (n = 14) (lvls, us = 932, 1523, 2090, 2930, 8841)
[32]
[76.965s][info][gc,stats ] Concurrent Reset = 0.044 s (a = 3117 us) (n = 14) (lvls, us = 1191, 1211, 1328, 1348, 14768)
[76.965s][info][gc,stats ] Concurrent Reset After Collect = 0.047 s (a = 3323 us) (n = 14) (lvls, us = 930, 1406, 1875, 4316, 8565)
[64]
[77.255s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3033 us) (n = 14) (lvls, us = 1152, 1211, 1270, 1406, 14635)
[77.255s][info][gc,stats ] Concurrent Reset After Collect = 0.054 s (a = 3862 us) (n = 14) (lvls, us = 1133, 1504, 2852, 5508, 8947)
[128]
[76.502s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3027 us) (n = 14) (lvls, us = 1133, 1230, 1250, 1426, 14264)
[76.502s][info][gc,stats ] Concurrent Reset After Collect = 0.053 s (a = 3762 us) (n = 14) (lvls, us = 1172, 1582, 2129, 5273, 9272)
[256]
[76.751s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3057 us) (n = 14) (lvls, us = 1133, 1230, 1270, 1426, 14713)
[76.751s][info][gc,stats ] Concurrent Reset After Collect = 0.056 s (a = 4029 us) (n = 14) (lvls, us = 1484, 1602, 3027, 4629, 11267)
[512]
[77.508s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3082 us) (n = 14) (lvls, us = 1133, 1230, 1270, 1426, 14893)
[77.508s][info][gc,stats ] Concurrent Reset After Collect = 0.068 s (a = 4822 us) (n = 14) (lvls, us = 1953, 2285, 3633, 5605, 16366)
[1024]
[76.933s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3073 us) (n = 14) (lvls, us = 1152, 1211, 1270, 1426, 14957)
[76.933s][info][gc,stats ] Concurrent Reset After Collect = 0.082 s (a = 5877 us) (n = 14) (lvls, us = 1895, 3203, 4258, 7793, 15587)
[2048]
[76.746s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3022 us) (n = 14) (lvls, us = 1133, 1172, 1211, 1406, 14586)
[76.746s][info][gc,stats ] Concurrent Reset After Collect = 0.099 s (a = 7104 us) (n = 14) (lvls, us = 1875, 3281, 4590, 7695, 19292)
[4096]
[77.356s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3031 us) (n = 14) (lvls, us = 1133, 1191, 1250, 1426, 14606)
[77.356s][info][gc,stats ] Concurrent Reset After Collect = 0.101 s (a = 7213 us) (n = 14) (lvls, us = 1914, 3262, 4238, 7871, 19862)
```
As we increase the value for ShenandoahParallelRegionStride, the performance of `Concurrent Reset After Collect` is getter worse.
Few thoughts about the problem:
* Concurrent reset may not be a good candidate to use parallel_heap_region_iterate, because task is not really lightweight.
* For Genshen, if the closure is decorated with ShenandoahExcludeRegionClosure, the default ShenandoahParallelRegionStride shall not be used, because the filter logic in ShenandoahExcludeRegionClosure may cause imbalance workload being assigned to each GC worker threads, causing performance regression.
- links to
-
Commit(master)
openjdk/jdk/db2a5420
-
Review(master)
openjdk/jdk/28613