When a WorkItem fails with Error (as was recently experienced in SKARA-1825), the item is never removed from the active set in the BotRunner. This causes us to eventually log/ping admins about WorkItems running for too long. We also never try to log the Error, so we aren't notified about the actual problem.
I think we should add a try catch of Error/Throwable at the top level of RunnableWorkItem::run where we log and re-throw the Error.
Handling removal from the active set can be trickier, but we should at least attempt it. The problem is that we have to synchronize access to that collection. Maybe it's enough to just log the Error. Leaving the WorkItem in the active set prevents us from trying to run it again, which may be a good thing if running it is triggering Errors, but it also prevents the bot runner from trying to fully recover. In the current synchronized block we are also attempting to schedule pending tasks. It's probably not a good idea to try to do more work than absolutely necessary in a thread that has just thrown an Error.
I think we should add a try catch of Error/Throwable at the top level of RunnableWorkItem::run where we log and re-throw the Error.
Handling removal from the active set can be trickier, but we should at least attempt it. The problem is that we have to synchronize access to that collection. Maybe it's enough to just log the Error. Leaving the WorkItem in the active set prevents us from trying to run it again, which may be a good thing if running it is triggering Errors, but it also prevents the bot runner from trying to fully recover. In the current synchronized block we are also attempting to schedule pending tasks. It's probably not a good idea to try to do more work than absolutely necessary in a thread that has just thrown an Error.