Uploaded image for project: 'Skara'
  1. Skara
  2. SKARA-1565

Only poll for updated MRs from GitLab

    XMLWordPrintable

Details

    • Enhancement
    • Resolution: Fixed
    • P3
    • 1.0
    • None
    • bots
    • None

    Description

      One of the biggest culprits for bot performance, at least for bots interacting with GitLab, is that we aren't currently trying to just look at MRs that have changed since last we checked. We are instead always fetching all open MRs and always spawning WorkItems for all of them. This certainly doesn't scale well.

      The reason we do this can be found in a comment here regarding the "updated_at" field in MRs:
      https://github.com/openjdk/skara/blob/c7fc108a271580fa7e6afecf290d5ba0fafeba05/forge/src/main/java/org/openjdk/skara/forge/PullRequestUpdateCache.java#L36

      "GitLab CE does not update this field on events such as adding an award"

      I started to investigate this, but my findings aren't matching what this comment claims. According to comments in this gitlab issue https://gitlab.com/gitlab-org/gitlab/-/issues/21312, it's actually worse. It basically says that "updated_at" fields are capped to update at most once per minute. In my own testing, it also seems like updates that never touched the timestamp will never be reflected in the timestamp. This makes it pretty complex to reliably query for only updated MRs. It's still possible though, with some workarounds, and I would like to implement something that solves this.

      For Github, it seems like we can trust the updated_at field. This means a simple algorithm like this should work fine: First round, query for all PRs. Save the highest updated_at timestamp found in the result for next round. After that, query for all PRs with a higher updated_at than the one found in the previous round.

      For GitLab, we have two complications. The first is that the PR could have had an update up to one minute after the current updated_at value. To mitigate this, we can simply change the query to get all PRs with updated_at higher than 1 minute before the highest one found in the previous round. This will guarantee that we aren't missing any PRs. This may be good enough, and would still be a big improvement, but it would be even better if we could avoid basically guaranteed false positives. To avoid false positives, we can save the PR objects from the last round and compare them to the new query result. We only need to consider the PR updated if there is a difference.

      The second complication is that not all PR state that we care about is in the MR query result. Typically reviews (approvals) and comments (notes) need to be queried separately. This means that to be able to dismiss a PR as a false positive, we also need to fetch the recently updated reviews and comments and compare them for differences. At least with SKARA-1555, we don't need to also fetch award emojis.

      We could potentially reduce the need for false positives checks by keeping track of a bot local timestamp in parallel with each last updated_at from the query results. The local timestamp, together with a boolean flag, could then be used to track when enough time has passed so that we don't need to check for false positives.

      I intend to implement this logic in a common class in the forge module, so that all bots that need to subscribe to PR updates can just simply use this with a single call like 'getUpdatedPullRequests()'. By always querying for updated PRs, there will no longer be a need to periodically check closed PRs instead of just OPEN. Some level of customization will be available as some bots need to consider closed PRs and others don't.

      Attachments

        Issue Links

          Activity

            People

              erikj Erik Joelsson
              erikj Erik Joelsson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: