When mlbridge is restarted after having the scratch dirs cleared out, it hits a restart loop with the Watchdog. Today I have observed a 20 minute restart cycle, but the cycle is getting longer and longer after a while, so we will eventually reach a steady state.
The Watchdog is hard coded to restart the java process if it hasn't been pinged within 10 minutes since last time. AfterSKARA-1012, cloning repos takes longer than before, so my guess is that now it takes long enough so that all executors get stuck for more than 10 minutes at the same time.
I think we need to make this timeout configurable so it can be adapted for different bot runner configurations.
The Watchdog is hard coded to restart the java process if it hasn't been pinged within 10 minutes since last time. After
I think we need to make this timeout configurable so it can be adapted for different bot runner configurations.