-
Enhancement
-
Resolution: Fixed
-
P4
-
20
-
b17
We have this code code in our signal handler:
```
#ifndef AMD64
// Halt if SI_KERNEL before more crashes get misdiagnosed as Java bugs
// This can happen in any running code (currently more frequently in
// interpreter code but has been seen in compiled code)
if (sig == SIGSEGV && info->si_addr == 0 && info->si_code == SI_KERNEL) {
fatal("An irrecoverable SI_KERNEL SIGSEGV has occurred due "
"to unstable signal handling in this distribution.");
}
#endif // AMD64
```
This bug added that change:
https://bugs.openjdk.java.net/browse/JDK-8004124
In the Generational ZGC we hit the exact same condition whenever we try to (incorrectly) dereference one of our colored pointers. From the bug above:
"A segmentation violation that occurs as a result of userspace process accessing virtual memory above the TASK_SIZE limit will cause a segmentation violation with an si_code of SI_KERNEL"
That is, if we have set high-order bits (past the TASK_SIZE limit), we get these kind of SIGSEGVs.
As the signal handle code is written today, we don't "stop" this signal, and instead try to handle it as an implicit null check. This causes hard-to-debug error messages and crashes in code that incorrectly try to deoptimize the faulty code.
I propose that we short-cut the signal handling code, and let this problematic SIGSEGV get passed to VMError::report_and_die.
We've been running with this patch in the Generational ZGC repository, without any problems.
```
#ifndef AMD64
// Halt if SI_KERNEL before more crashes get misdiagnosed as Java bugs
// This can happen in any running code (currently more frequently in
// interpreter code but has been seen in compiled code)
if (sig == SIGSEGV && info->si_addr == 0 && info->si_code == SI_KERNEL) {
fatal("An irrecoverable SI_KERNEL SIGSEGV has occurred due "
"to unstable signal handling in this distribution.");
}
#endif // AMD64
```
This bug added that change:
https://bugs.openjdk.java.net/browse/JDK-8004124
In the Generational ZGC we hit the exact same condition whenever we try to (incorrectly) dereference one of our colored pointers. From the bug above:
"A segmentation violation that occurs as a result of userspace process accessing virtual memory above the TASK_SIZE limit will cause a segmentation violation with an si_code of SI_KERNEL"
That is, if we have set high-order bits (past the TASK_SIZE limit), we get these kind of SIGSEGVs.
As the signal handle code is written today, we don't "stop" this signal, and instead try to handle it as an implicit null check. This causes hard-to-debug error messages and crashes in code that incorrectly try to deoptimize the faulty code.
I propose that we short-cut the signal handling code, and let this problematic SIGSEGV get passed to VMError::report_and_die.
We've been running with this patch in the Generational ZGC repository, without any problems.