-
Bug
-
Resolution: Fixed
-
P3
-
10, 11, 12, 13, 14
-
b06
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8261602 | 13.0.7 | Ekaterina Vergizova | P3 | Resolved | Fixed | b02 |
JDK-8227571 | 13 | Thomas Stuefe | P3 | Closed | Won't Fix | |
JDK-8257849 | 11.0.11-oracle | Dukebot | P3 | Resolved | Fixed | b01 |
JDK-8257399 | 11.0.10 | Thomas Stuefe | P3 | Resolved | Fixed | b05 |
--
When an assert happens, we touch a poison page to receive the current ucontext for error analysis. That works like this:
assert ->
touch assertion poison page (immediately, in the same frame, with as little as possible code running after evaluating the assert condition) ->
bang! enter signal handler ->
in signal handling, copy ucontext ->
and disable poison page ->
return from signal handler, brings us to the same load which triggered the original crash ->
repeat touching the poison page. It is disarmed now, so a noop ->
continue handling the assertion.
In case of a native OOM, this may fail; the mprotect call used to disarm the poison page may return with ENOMEM (depends on the OS, but can happen e.g. on Linux when switching from PROT_NONE to PROT_RW). Leaving the poison page armed.
The chance of this happening for normal assertion scenario (an OOM hitting out of the blue just when we hit an assert and attempt to disarm the poison page) is astronomically small.
However, this may happen as a result of an OOM elsewhere, which could trigger a follow up assertion. Then this happens:
... OOM! ...
...
assert ->
touch assert poison page ->
bang! enter signal handler ->
in signal handling, copy ucontext ->
and disable poison page - but that fails! ->
current code does not care, returns to asserting code, to the same opcode ->
again touch assert poison. ->
enter signal handler ->
repeat...
...
Endless loop; since we do not use stack space this can go on forever, and since we effectively disable signal handling the error handler timeout does not seem to work either. Process hangs.
Most native OOM situations in the hotspot are handled cleanly: they either are handled explicitly by the caller or they enter error handling via VMError::report_vm_out_of_memory(). This means that an assertion following a native OOM most likely happens during error handling. This slightly changes the picture above:
... OOM! ...
...
assert ->
touch assert poison page ->
bang! enter secondary signal handler (crash_handler() in vmError_posix.cpp) ->
in signal handling, copy ucontext ->
and disable poison page - but that fails! ->
current code does not care, returns to asserting code, to the same opcode ->
again touch assert poison. ->
enter secondary signal handler (crash_handler() in vmError_posix.cpp) ->
repeat...
...
One simple fix could be to just switch off the assertion poison page after entering the VMError::report_and_die(). We do not need it from that point on, since we do not care for secondary asserts or asserts happening in parallel threads (much).
Also, when we fail to disarm the poison page, we should not just return from the signal handler. Since we cannot do much else, we should proceed as if this were a real crash. This will "hide" an assert behind a SIGSEGV and can be confusing if one does not closely examines the call stack, but it is still better than the process hanging.
- backported by
-
JDK-8257399 Within native OOM error handling, assertions may hang the process
- Resolved
-
JDK-8257849 Within native OOM error handling, assertions may hang the process
- Resolved
-
JDK-8261602 Within native OOM error handling, assertions may hang the process
- Resolved
-
JDK-8227571 Within native OOM error handling, assertions may hang the process
- Closed
- relates to
-
JDK-8216982 Assertion poison page established too early
- Resolved
-
JDK-8225703 crash_handler code makes safepoint polling threads look like they crashed
- Closed
-
JDK-8191101 Show register content in hs-err file on assert
- Resolved