Watchdog causing multiple restarts for mlbridge

XMLWordPrintable

    • Type: Bug
    • Resolution: Fixed
    • Priority: P3
    • 0.9
    • Affects Version/s: 0.9
    • Component/s: bots
    • None

      When mlbridge is restarted after having the scratch dirs cleared out, it hits a restart loop with the Watchdog. Today I have observed a 20 minute restart cycle, but the cycle is getting longer and longer after a while, so we will eventually reach a steady state.

      The Watchdog is hard coded to restart the java process if it hasn't been pinged within 10 minutes since last time. After SKARA-1012, cloning repos takes longer than before, so my guess is that now it takes long enough so that all executors get stuck for more than 10 minutes at the same time.

      I think we need to make this timeout configurable so it can be adapted for different bot runner configurations.

            Assignee:
            Erik Joelsson
            Reporter:
            Erik Joelsson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: