-
Bug
-
Resolution: Fixed
-
P2
-
11, 16, 17
-
b25
Performance regression was observed with a warmup process that does class loading and initialization with CDS enabled (archiving ~10,000 classes). Investigation found the effect was correlated to the runtime Java heap region size and was caused by increased GC overhead (particularly young-gen GC). The overhead was worse and more pronounced with enlarged region sizes. When associated with enlarged region size, the overhead was significant enough to reduce all measurable savings from archiving.
G1BlockOffsetTable divides the covered space (Java heap) into “N”-word subregions (“N” is from 2^”LogN”). It uses an _offset_array to tell how far back it must go to find the start of a block that contains the first word of a subregion. Every G1 region (is a G1ContiguousSpace) owns a G1BlockOffsetTablePart (associates to part of the _offset_array), which covers space of the current region.
For a pre-populated (with archive Java objects) open archive heap region, its G1BlockOffsetTablePart is never set up at runtime because there is no allocation done within the region. As a result, G1BlockOffsetTablePart::block_start(const void* addr) always does lookup from the start (bottom) of the region when called (for an open archive region) at runtime, regardless if the given 'addr' is near the bottom, top or in the middle of the region. The lookup becomes linear, instead of O(2^LogN). Large heap region size makes the situation worse and young-gen pauses longer.
The proposed fix is to populate G1BlockOffsetTableParts and associated G1BlockOffsetTable::_offset_array entries for 'open' archive regions at runtime. The fix makes the observed GC overhead go away completely. When running a standalone test that loads & initializes ~10,000 classes using 5G Java heap and 8M region size with CDS archive enabled, CPU cycles reported by 'perf stat' is down from 407,696,156,653 (before) to 116,174,811,999 (after), ~3.5x improvement (measured on a local linux machine on JDK 11).
Please see more details: https://github.com/jianglizhou/OpenJDK-docs/blob/main/Performance%20Impact%20caused%20by%20Large%20G1%20Region%20with%20Open%20Archive%20Heap%20objects.pdf.
G1BlockOffsetTable divides the covered space (Java heap) into “N”-word subregions (“N” is from 2^”LogN”). It uses an _offset_array to tell how far back it must go to find the start of a block that contains the first word of a subregion. Every G1 region (is a G1ContiguousSpace) owns a G1BlockOffsetTablePart (associates to part of the _offset_array), which covers space of the current region.
For a pre-populated (with archive Java objects) open archive heap region, its G1BlockOffsetTablePart is never set up at runtime because there is no allocation done within the region. As a result, G1BlockOffsetTablePart::block_start(const void* addr) always does lookup from the start (bottom) of the region when called (for an open archive region) at runtime, regardless if the given 'addr' is near the bottom, top or in the middle of the region. The lookup becomes linear, instead of O(2^LogN). Large heap region size makes the situation worse and young-gen pauses longer.
The proposed fix is to populate G1BlockOffsetTableParts and associated G1BlockOffsetTable::_offset_array entries for 'open' archive regions at runtime. The fix makes the observed GC overhead go away completely. When running a standalone test that loads & initializes ~10,000 classes using 5G Java heap and 8M region size with CDS archive enabled, CPU cycles reported by 'perf stat' is down from 407,696,156,653 (before) to 116,174,811,999 (after), ~3.5x improvement (measured on a local linux machine on JDK 11).
Please see more details: https://github.com/jianglizhou/OpenJDK-docs/blob/main/Performance%20Impact%20caused%20by%20Large%20G1%20Region%20with%20Open%20Archive%20Heap%20objects.pdf.