-
Enhancement
-
Resolution: Fixed
-
P4
-
18
-
b11
-
aarch64
Created on behalf of lxw263044@alibaba-inc.com
------
ZGC: Adopt relaxed ordering for self-healing
ZGC utilizes self-healing in load barrier to fix references to old objects. Currently, this fixing (ZBarrier::self_heal) adopts memory_order_conservative to guarantee that (1) the slow path (relocate, mark, etc. where we get healed address) always happens before self healing, and (2) the other thread that accesses the same reference is able to access the content at the healed address.
Let us consider changing the bounded releasing CAS in forwarding table installation to unbounded. In this way, the release (OrderAccess::release()) serves as a membar to block any subsequent store operations (forwarding table installation and self healing). More specifically, we can use release(), relaxed_cas() in the forwarding table and relaxed_cas() in the self healing. In the scenario when a thread reads healed address from forwarding table, relaxed_cas() is also fine due to the load-acquire in forwarding table.
Since release is done only in forwarding table, the majority of self heals, which only remap the pointer, is no longer restricted by the memory order. Performance is visibly improved, demonstrated by benchmark corretto/heapothesys on AArch64. The optimized version decreases both average concurrent mark/relocation time by 15%-20% and 10%-15%, respectively。
baseline
[1000.412s][info ][gc,stats ] Phase: Concurrent Relocate 0.000 / 0.000 161.031 / 232.546 147.641 / 289.855 147.641 / 289.855 ms
[1100.412s][info ][gc,stats ] Phase: Concurrent Relocate 220.739 / 220.739 162.373 / 232.546 149.909 / 289.855 149.909 / 289.855 ms
[1200.411s][info ][gc,stats ] Phase: Concurrent Relocate 0.000 / 0.000 157.900 / 232.546 148.912 / 289.855 148.912 / 289.855 ms
[1000.412s][info ][gc,stats ] Phase: Concurrent Mark 0.000 / 0.000 816.505 / 1508.676 759.533 / 1567.506 759.533 / 1567.506 ms
[1100.412s][info ][gc,stats ] Phase: Concurrent Mark 1262.680 / 1262.680 838.750 / 1508.676 772.495 / 1567.506 772.495 / 1567.506 ms
[1200.411s][info ][gc,stats ] Phase: Concurrent Mark 0.000 / 0.000 813.220 / 1422.825 769.346 / 1567.506 769.346 / 1567.506 ms
optimized
[1000.447s][info ][gc,stats ] Phase: Concurrent Relocate 248.035 / 248.035 162.719 / 253.498 124.061 / 273.227 124.061 / 273.227 ms
[1100.447s][info ][gc,stats ] Phase: Concurrent Relocate 217.434 / 217.434 161.209 / 253.498 126.767 / 273.227 126.767 / 273.227 ms
[1200.447s][info ][gc,stats ] Phase: Concurrent Relocate 0.000 / 0.000 164.023 / 253.498 129.314 / 273.227 129.314 / 273.227 ms
[1000.447s][info ][gc,stats ] Phase: Concurrent Mark 1224.059 / 1224.059 809.838 / 1518.214 582.867 / 1518.214 582.867 / 1518.214 ms
[1100.447s][info ][gc,stats ] Phase: Concurrent Mark 1302.235 / 1302.235 793.589 / 1518.214 600.215 / 1518.214 600.215 / 1518.214 ms
[1200.447s][info ][gc,stats ] Phase: Concurrent Mark 0.000 / 0.000 821.320 / 1518.214 615.187 / 1518.214 615.187 / 1518.214 ms
------
ZGC: Adopt relaxed ordering for self-healing
ZGC utilizes self-healing in load barrier to fix references to old objects. Currently, this fixing (ZBarrier::self_heal) adopts memory_order_conservative to guarantee that (1) the slow path (relocate, mark, etc. where we get healed address) always happens before self healing, and (2) the other thread that accesses the same reference is able to access the content at the healed address.
Let us consider changing the bounded releasing CAS in forwarding table installation to unbounded. In this way, the release (OrderAccess::release()) serves as a membar to block any subsequent store operations (forwarding table installation and self healing). More specifically, we can use release(), relaxed_cas() in the forwarding table and relaxed_cas() in the self healing. In the scenario when a thread reads healed address from forwarding table, relaxed_cas() is also fine due to the load-acquire in forwarding table.
Since release is done only in forwarding table, the majority of self heals, which only remap the pointer, is no longer restricted by the memory order. Performance is visibly improved, demonstrated by benchmark corretto/heapothesys on AArch64. The optimized version decreases both average concurrent mark/relocation time by 15%-20% and 10%-15%, respectively。
baseline
[1000.412s][info ][gc,stats ] Phase: Concurrent Relocate 0.000 / 0.000 161.031 / 232.546 147.641 / 289.855 147.641 / 289.855 ms
[1100.412s][info ][gc,stats ] Phase: Concurrent Relocate 220.739 / 220.739 162.373 / 232.546 149.909 / 289.855 149.909 / 289.855 ms
[1200.411s][info ][gc,stats ] Phase: Concurrent Relocate 0.000 / 0.000 157.900 / 232.546 148.912 / 289.855 148.912 / 289.855 ms
[1000.412s][info ][gc,stats ] Phase: Concurrent Mark 0.000 / 0.000 816.505 / 1508.676 759.533 / 1567.506 759.533 / 1567.506 ms
[1100.412s][info ][gc,stats ] Phase: Concurrent Mark 1262.680 / 1262.680 838.750 / 1508.676 772.495 / 1567.506 772.495 / 1567.506 ms
[1200.411s][info ][gc,stats ] Phase: Concurrent Mark 0.000 / 0.000 813.220 / 1422.825 769.346 / 1567.506 769.346 / 1567.506 ms
optimized
[1000.447s][info ][gc,stats ] Phase: Concurrent Relocate 248.035 / 248.035 162.719 / 253.498 124.061 / 273.227 124.061 / 273.227 ms
[1100.447s][info ][gc,stats ] Phase: Concurrent Relocate 217.434 / 217.434 161.209 / 253.498 126.767 / 273.227 126.767 / 273.227 ms
[1200.447s][info ][gc,stats ] Phase: Concurrent Relocate 0.000 / 0.000 164.023 / 253.498 129.314 / 273.227 129.314 / 273.227 ms
[1000.447s][info ][gc,stats ] Phase: Concurrent Mark 1224.059 / 1224.059 809.838 / 1518.214 582.867 / 1518.214 582.867 / 1518.214 ms
[1100.447s][info ][gc,stats ] Phase: Concurrent Mark 1302.235 / 1302.235 793.589 / 1518.214 600.215 / 1518.214 600.215 / 1518.214 ms
[1200.447s][info ][gc,stats ] Phase: Concurrent Mark 0.000 / 0.000 821.320 / 1518.214 615.187 / 1518.214 615.187 / 1518.214 ms
- relates to
-
JDK-8273122 ZGC: Load forwarding entries without acquire semantics
-
- Open
-