Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8227275

Within native OOM error handling, assertions may hang the process

XMLWordPrintable

    • b06

        Summary: on OOM, we fail to disarm assertion poison page; this may lead to endless loops during error handling if assertions happen in native OOM scenarios.

        --

        When an assert happens, we touch a poison page to receive the current ucontext for error analysis. That works like this:

        assert ->
        touch assertion poison page (immediately, in the same frame, with as little as possible code running after evaluating the assert condition) ->
        bang! enter signal handler ->
        in signal handling, copy ucontext ->
        and disable poison page ->
        return from signal handler, brings us to the same load which triggered the original crash ->
        repeat touching the poison page. It is disarmed now, so a noop ->
        continue handling the assertion.

        In case of a native OOM, this may fail; the mprotect call used to disarm the poison page may return with ENOMEM (depends on the OS, but can happen e.g. on Linux when switching from PROT_NONE to PROT_RW). Leaving the poison page armed.

        The chance of this happening for normal assertion scenario (an OOM hitting out of the blue just when we hit an assert and attempt to disarm the poison page) is astronomically small.

        However, this may happen as a result of an OOM elsewhere, which could trigger a follow up assertion. Then this happens:

        ... OOM! ...
        ...
        assert ->
        touch assert poison page ->
        bang! enter signal handler ->
        in signal handling, copy ucontext ->
        and disable poison page - but that fails! ->
        current code does not care, returns to asserting code, to the same opcode ->
        again touch assert poison. ->
        enter signal handler ->
        repeat...
        ...

        Endless loop; since we do not use stack space this can go on forever, and since we effectively disable signal handling the error handler timeout does not seem to work either. Process hangs.

        Most native OOM situations in the hotspot are handled cleanly: they either are handled explicitly by the caller or they enter error handling via VMError::report_vm_out_of_memory(). This means that an assertion following a native OOM most likely happens during error handling. This slightly changes the picture above:

        ... OOM! ...
        ...
        assert ->
        touch assert poison page ->
        bang! enter secondary signal handler (crash_handler() in vmError_posix.cpp) ->
        in signal handling, copy ucontext ->
        and disable poison page - but that fails! ->
        current code does not care, returns to asserting code, to the same opcode ->
        again touch assert poison. ->
        enter secondary signal handler (crash_handler() in vmError_posix.cpp) ->
        repeat...
        ...


        One simple fix could be to just switch off the assertion poison page after entering the VMError::report_and_die(). We do not need it from that point on, since we do not care for secondary asserts or asserts happening in parallel threads (much).

        Also, when we fail to disarm the poison page, we should not just return from the signal handler. Since we cannot do much else, we should proceed as if this were a real crash. This will "hide" an assert behind a SIGSEGV and can be confusing if one does not closely examines the call stack, but it is still better than the process hanging.

              stuefe Thomas Stuefe
              stuefe Thomas Stuefe
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: