Details
-
Bug
-
Resolution: Fixed
-
P3
-
11, 15, 17, 19, 20
-
b18
Backports
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8298840 | 17.0.7-oracle | Tobias Hartmann | P3 | Resolved | Fixed | b01 |
JDK-8299477 | 17.0.7 | Goetz Lindenmaier | P3 | Resolved | Fixed | b01 |
JDK-8299142 | 11.0.19-oracle | Igor Veresov | P3 | Resolved | Fixed | b01 |
Description
In ZGC we are very sensitive to accesses and their GC barriers being separated by safepoints. We never managed to tame the sea of nodes to ensure this, and hence elected to go with expanding the barriers as late as humanly possible, right before assembly.
For some reason it has been believed that despite this being a huge problem for ZGC, it is not at all a problem for, e.g., SATB collectors (like G1 and Shenandoah). But they face the same issue. Consider the Reference.get() intrinsic. It loads the referent, and enqueues it into its thread-local SATB buffer. The load of the referent, and the store, that publishes it to the thread-local SATB buffer may *not* be separated by any safepoints.
At safepoint polls, we can also deoptimize. Most of the time G1 is fine, because of re-marking when concurrent marking terminates in a safepoint. If it finds something not yet marked on the stack, it marks it during remark. That is probably why we do not see any crashes. However, consider a Java call being between the referent load and the SATB-barrier. The nmethod may then get deoptimized when the callee returns. In such a scenario, nobody will have marked the referent, and when we roll into the interpreter, it may store it into some field, and return. Now the object graph has been corrupted with an object that will never get marked.
There are similar issues involving stack walkers that can catch the local storing the referent, before it has become SATB enqueued, and then the nmethod deoptimizes.
Not your every day race, but it is possible in theory, and is quite annoying.
This similarly applies to the "is marking active" load expanded in the pre-write store barriers for G1.
I have written some verification code for G1 (only works without compressed oops for now):
http://cr.openjdk.java.net/~eosterlund/to_kim/webrev.00/
This patch tags the referent loads, and the SATB store with the same unique number. Then it checks the mach nodes before generating machine code, matching the load and store, traversing the store and its dominators up to the load, asserting that there are no safepoints in these blocks. With SPECjbb2015 the assertion fails after a few seconds, which gives a hint that this scary stuff is probably happening for real, and has been broken since forever.
For some reason it has been believed that despite this being a huge problem for ZGC, it is not at all a problem for, e.g., SATB collectors (like G1 and Shenandoah). But they face the same issue. Consider the Reference.get() intrinsic. It loads the referent, and enqueues it into its thread-local SATB buffer. The load of the referent, and the store, that publishes it to the thread-local SATB buffer may *not* be separated by any safepoints.
At safepoint polls, we can also deoptimize. Most of the time G1 is fine, because of re-marking when concurrent marking terminates in a safepoint. If it finds something not yet marked on the stack, it marks it during remark. That is probably why we do not see any crashes. However, consider a Java call being between the referent load and the SATB-barrier. The nmethod may then get deoptimized when the callee returns. In such a scenario, nobody will have marked the referent, and when we roll into the interpreter, it may store it into some field, and return. Now the object graph has been corrupted with an object that will never get marked.
There are similar issues involving stack walkers that can catch the local storing the referent, before it has become SATB enqueued, and then the nmethod deoptimizes.
Not your every day race, but it is possible in theory, and is quite annoying.
This similarly applies to the "is marking active" load expanded in the pre-write store barriers for G1.
I have written some verification code for G1 (only works without compressed oops for now):
http://cr.openjdk.java.net/~eosterlund/to_kim/webrev.00/
This patch tags the referent loads, and the SATB store with the same unique number. Then it checks the mach nodes before generating machine code, matching the load and store, traversing the store and its dominators up to the load, asserting that there are no safepoints in these blocks. With SPECjbb2015 the assertion fails after a few seconds, which gives a hint that this scary stuff is probably happening for real, and has been broken since forever.
Attachments
Issue Links
- backported by
-
JDK-8298840 C2 SATB barriers are not safepoint-safe
- Resolved
-
JDK-8299142 C2 SATB barriers are not safepoint-safe
- Resolved
-
JDK-8299477 C2 SATB barriers are not safepoint-safe
- Resolved
- relates to
-
JDK-8295066 Folding of loads is broken in C2 after JDK-8242115
- Resolved
- links to
-
Commit openjdk/jdk17u-dev/abfa08fb
-
Commit openjdk/jdk/c6e3daa5
-
Review openjdk/jdk17u-dev/999
-
Review openjdk/jdk/10517
(3 links to)