Uploaded image for project: 'Skara'
  1. Skara
  2. SKARA-1042

Watchdog causing multiple restarts for mlbridge

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: P3
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.9
    • Component/s: bots
    • Labels:
      None

      Description

      When mlbridge is restarted after having the scratch dirs cleared out, it hits a restart loop with the Watchdog. Today I have observed a 20 minute restart cycle, but the cycle is getting longer and longer after a while, so we will eventually reach a steady state.

      The Watchdog is hard coded to restart the java process if it hasn't been pinged within 10 minutes since last time. After SKARA-1012, cloning repos takes longer than before, so my guess is that now it takes long enough so that all executors get stuck for more than 10 minutes at the same time.

      I think we need to make this timeout configurable so it can be adapted for different bot runner configurations.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              erikj Erik Joelsson
              Reporter:
              erikj Erik Joelsson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: