Uploaded image for project: 'Skara'
  1. Skara
  2. SKARA-1042

Watchdog causing multiple restarts for mlbridge

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P3 P3
    • 0.9
    • 0.9
    • bots
    • None

      When mlbridge is restarted after having the scratch dirs cleared out, it hits a restart loop with the Watchdog. Today I have observed a 20 minute restart cycle, but the cycle is getting longer and longer after a while, so we will eventually reach a steady state.

      The Watchdog is hard coded to restart the java process if it hasn't been pinged within 10 minutes since last time. After SKARA-1012, cloning repos takes longer than before, so my guess is that now it takes long enough so that all executors get stuck for more than 10 minutes at the same time.

      I think we need to make this timeout configurable so it can be adapted for different bot runner configurations.

            erikj Erik Joelsson
            erikj Erik Joelsson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: