The new PullRequestPoller from SKARA-1565 is inefficient in combination with GitLab. In the most common case, when no MRs have been updated, you would expect it to only do one GET query and be done. Instead, that initial query will always return the last updated MR, and then continue to fetch all the metadata for it, which adds up to a lot of time.
This was supposed to be mitigated with the "paddingNeeded" boolean field, which tracks when we need to apply the query padding on the updatedAfter parameter. Unfortunately, this isn't enough for GitLabRepository. The merge_requests REST API specifies the "updated_after" parameter like this:
"Return merge requests updated on or after the given time. Expected in ISO 8601 format (2019-03-15T08:00:00Z)."
So the timestamp is inclusive as well as only on 1s resolution. This creates a similar situation like we have in the IssuePoller, where Jira queries have a resolution of 1 minute. In comparison, the GitLabRepository does the timestamp filtering on the client side, using an exclusive ZonedDateTime::isAfter check. What this amounts to is basically this:
GitLabRepository::pullRequestsAfter(myPr.updatedAt()) -> will include myPr
GitHubRepository::pullRequestsAfter(myPr.updatedAt()) -> will not include myPr
I see two possible ways of solving this:
1. Add client side filtering on timestamps in GitLabRepository, to make the semantics the same between GitLabRepository and GitHubRepository.
2. Implement the same kind of padding for the updatedAfter parameter in PullRequestPoller as we already have in IssuePoller. That means, when enough time has passed since we last found something (1s in this case), we add 1s to the updatedAfter parameter.
Alternative 1 is simpler, and having semantic parity between the two implementations would be good. However, it depends on us being able to trust that the internal timestamp resolution on the forges is high enough that two PRs can't end up with the same updatedAt timestamps, while only one is returned from an updatedAfter query at that same time. If we choose to not trust that, then we need to fix 1 by changing GitHubRepository to have an inclusive check, and do 2 as well. I'm choosing to do the latter.
Another inefficiency is the amount of queries a GitLabMergeRequest needs to perform to gather comments and reviews. These are fetched for all MRs by the poller as we need to potentially compare them for updates. In reality, both of these are stored as "notes" by GitLab and if we only need them for comparison purposes, we can just fetch all notes in a single query. I think we can abstract this with a new method on PullRequest that returns an Object that we only use for isUpdated comparisons. Adding a lazy cache for this snapshot will also help, as PullRequestPoller currently needs to call it twice for most PRs.
This was supposed to be mitigated with the "paddingNeeded" boolean field, which tracks when we need to apply the query padding on the updatedAfter parameter. Unfortunately, this isn't enough for GitLabRepository. The merge_requests REST API specifies the "updated_after" parameter like this:
"Return merge requests updated on or after the given time. Expected in ISO 8601 format (2019-03-15T08:00:00Z)."
So the timestamp is inclusive as well as only on 1s resolution. This creates a similar situation like we have in the IssuePoller, where Jira queries have a resolution of 1 minute. In comparison, the GitLabRepository does the timestamp filtering on the client side, using an exclusive ZonedDateTime::isAfter check. What this amounts to is basically this:
GitLabRepository::pullRequestsAfter(myPr.updatedAt()) -> will include myPr
GitHubRepository::pullRequestsAfter(myPr.updatedAt()) -> will not include myPr
I see two possible ways of solving this:
1. Add client side filtering on timestamps in GitLabRepository, to make the semantics the same between GitLabRepository and GitHubRepository.
2. Implement the same kind of padding for the updatedAfter parameter in PullRequestPoller as we already have in IssuePoller. That means, when enough time has passed since we last found something (1s in this case), we add 1s to the updatedAfter parameter.
Alternative 1 is simpler, and having semantic parity between the two implementations would be good. However, it depends on us being able to trust that the internal timestamp resolution on the forges is high enough that two PRs can't end up with the same updatedAt timestamps, while only one is returned from an updatedAfter query at that same time. If we choose to not trust that, then we need to fix 1 by changing GitHubRepository to have an inclusive check, and do 2 as well. I'm choosing to do the latter.
Another inefficiency is the amount of queries a GitLabMergeRequest needs to perform to gather comments and reviews. These are fetched for all MRs by the poller as we need to potentially compare them for updates. In reality, both of these are stored as "notes" by GitLab and if we only need them for comparison purposes, we can just fetch all notes in a single query. I think we can abstract this with a new method on PullRequest that returns an Object that we only use for isUpdated comparisons. Adding a lazy cache for this snapshot will also help, as PullRequestPoller currently needs to call it twice for most PRs.