The CommitCommentsWorkItem gets scheduled once for each repository by PullRequestBot::getPeriodicItems. It's responsible for querying the repository for new comments on commits, and if any are found, spawn CommitCommandWorkItem for them. It's unfortunately not very straight forward to query forges for new commit comments. For GitHub we have a rather complicated GraphQL that does the trick, but for GitLab, we need to jump through some rather nasty hoops.
The main issue is that the data we get from the "events" query from GitLab contains notes posted to commits, but without the commit hash. We only get the commit "title" to identify which commit was commented on. To solve this, we have a rather elaborate logic that first builds a complete map from title to set of hashes for every commit in the repo. This is done preemptively on a local clone of the repo. Using this map and some clever comparisons, we can figure out which commit each comment belongs to.
I don't have data on how long it takes to build this map for a big repository, like the JDK, but it's likely not trivially short. In addition to this, before using any local clone of a repository, we always run the git fsck check (seeSKARA-1598), which can take around 20s for the JDK repo. For this particular work item, the local clone is only used for building this map. What's worse is that the map is built regardless of what kind of forge we are using, so even for GitHub repos, we still build the map and then throw it away, and so we also run the fsck check for no reason as well.
This problem needs to be addressed in several ways. The biggest issue is not building the map (or creating the local repo clone) unless we need it (i.e. using gitlab). But even in the gitlab case, I think we could be smarter. We could build most of the map once and then keep it around, only updating it with new commits. Only the initial build would need to use a local clone. After that we could just update from the remote.
Ideally I would like to hide the whole concept of having to build such a map for this query behind the HostedRepository::recentCommitComments API call, so that only GitLabMergeRequest would need to know about and worry about it. That may be hard to accomplish however, as we would need a local repository at some point, and knowing where one could be created really is the responsibility of bot specific code.
The main issue is that the data we get from the "events" query from GitLab contains notes posted to commits, but without the commit hash. We only get the commit "title" to identify which commit was commented on. To solve this, we have a rather elaborate logic that first builds a complete map from title to set of hashes for every commit in the repo. This is done preemptively on a local clone of the repo. Using this map and some clever comparisons, we can figure out which commit each comment belongs to.
I don't have data on how long it takes to build this map for a big repository, like the JDK, but it's likely not trivially short. In addition to this, before using any local clone of a repository, we always run the git fsck check (see
This problem needs to be addressed in several ways. The biggest issue is not building the map (or creating the local repo clone) unless we need it (i.e. using gitlab). But even in the gitlab case, I think we could be smarter. We could build most of the map once and then keep it around, only updating it with new commits. Only the initial build would need to use a local clone. After that we could just update from the remote.
Ideally I would like to hide the whole concept of having to build such a map for this query behind the HostedRepository::recentCommitComments API call, so that only GitLabMergeRequest would need to know about and worry about it. That may be hard to accomplish however, as we would need a local repository at some point, and knowing where one could be created really is the responsibility of bot specific code.
- causes
-
SKARA-2491 GitLabRepository::recentCommitComments can miss commits
-
- Resolved
-