jtreg pools agent server JVMs. These pooled JVMs are then reused for subsequent test action executions. It is possible that the agent server JVM might have crashed during the time it was pooled or even when it was being added to the pool.
A recent improvement in jtreg (CODETOOLS-7903894) tried to avoid the situation where a crashed agent server might end up being used for subsequent test action and that test action would fail (for no fault of its) when the communication between jtreg and the crashed JVM doesn't establish. In that change, a check was added to see if the agent server JVM process was alive (java.lang.Process.isAlive()) before using that pooled agent for a test action.
It has now been noticed that this check may not be enough. When the agent server JVM crashes, it takes a while for the process to terminate. During that period it could be writing out the hs_err<pid>.log file and doing any other post crash activities. While this is going on, the process is still alive, so jtreg will end up using this agent from the pool for a subsequent test action. This again causes the test action to fail (for no fault of its) because it was allocated a JVM which is in the process of crashing.
It would be good to improve this detection of a crashing JVM before handing out this pooled agent to the test action. I think in additon to the Process.isAlive() check it should be possible to check if a hs_err<agent-pid>.log file is present in the agent server process' working directory to detect a crashing JVM. This should then help reduce these situations where the test action ends up using the crashing JVM.
A recent improvement in jtreg (
It has now been noticed that this check may not be enough. When the agent server JVM crashes, it takes a while for the process to terminate. During that period it could be writing out the hs_err<pid>.log file and doing any other post crash activities. While this is going on, the process is still alive, so jtreg will end up using this agent from the pool for a subsequent test action. This again causes the test action to fail (for no fault of its) because it was allocated a JVM which is in the process of crashing.
It would be good to improve this detection of a crashing JVM before handing out this pooled agent to the test action. I think in additon to the Process.isAlive() check it should be possible to check if a hs_err<agent-pid>.log file is present in the agent server process' working directory to detect a crashing JVM. This should then help reduce these situations where the test action ends up using the crashing JVM.