Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8174439 | 10 | Thomas Stuefe | P3 | Resolved | Fixed | b01 |
On AIX, the VM handles SIGDANGER by writing an error log and aborting. This is wrong for multiple reasons.
On AIX, the OS handles OOM situations the following way:
1) When low on paging space, SIGDANGER is sent to all processes. Default action is to ignore this signal.
2) If low paging space situation persists, processes are killed according to a ranking system, but processes with a SIGDANGER handler installed are spared.
The intent behind this is that a process may use SIGDANGER to monitor for instable situations and act accordingly, e.g. save valuable work.
By treating SIGDANGER like an assert, we are wrong in every situation:
- if the VM handling the signal is the culprit (using large amounts of memory), writing an hs-err file does not help, because hs-err file reporting takes time and paging space. We often see situations where the hs-err timeout (ErrorLogTimeout) kicks in and leaves us with a half written log file. But the time it took for the timeout to kick in was too much for the OS, which in the meantime started killing other - innocent - processes, where instead a reasonable action would have been to kill just us. But we were spared because we were busy handling SIGDANGER.
- If the VM handling the signal is not the culprit, it should ignore SIGDANGER, not eagerly abort.
So the best way would be to ignore SIGDANGER and not interfere with the OOM killing heuristics the OS is using.
For more details, see IBM Redbook "IBM Power Systems Performance Guide: Implementing and Optimizing", section about npswarn/npskill parameters.
On AIX, the OS handles OOM situations the following way:
1) When low on paging space, SIGDANGER is sent to all processes. Default action is to ignore this signal.
2) If low paging space situation persists, processes are killed according to a ranking system, but processes with a SIGDANGER handler installed are spared.
The intent behind this is that a process may use SIGDANGER to monitor for instable situations and act accordingly, e.g. save valuable work.
By treating SIGDANGER like an assert, we are wrong in every situation:
- if the VM handling the signal is the culprit (using large amounts of memory), writing an hs-err file does not help, because hs-err file reporting takes time and paging space. We often see situations where the hs-err timeout (ErrorLogTimeout) kicks in and leaves us with a half written log file. But the time it took for the timeout to kick in was too much for the OS, which in the meantime started killing other - innocent - processes, where instead a reasonable action would have been to kill just us. But we were spared because we were busy handling SIGDANGER.
- If the VM handling the signal is not the culprit, it should ignore SIGDANGER, not eagerly abort.
So the best way would be to ignore SIGDANGER and not interfere with the OOM killing heuristics the OS is using.
For more details, see IBM Redbook "IBM Power Systems Performance Guide: Implementing and Optimizing", section about npswarn/npskill parameters.
- backported by
-
JDK-8174439 [aix] AIX VM should not handle SIGDANGER
-
- Resolved
-