2d65d29d80ace5b05f8ee63c85fab57f607b6777

When a client asks the server for access to a Git repository, the server has to compute a list of tags to advertise to the client. It is an expensive computation to determine which tags are reachable from the branches the client has READ permission on. For large repositories with many tags the computation can take a considerable amount of time, bordering on several seconds per connection. To make the general case more efficient, introduce a cache called "git_tags". On a trivial usage of the Linux kernel repository, the average running time of the VisibleRefFilter when caches were hot was 7195.68 ms. With this commit, it is a mere 5.07 milliseconds on a hot cache. A reduction of 99% of the running time. The caches performs incremental updates under certain situations, making common updates relatively painless for waiting clients: * Branch fast-forward / change submission: When a branch fast-forwards to a new commit, or a change is submitted to a branch, all prior tags that are reachable from that branch are still reachable. The cache updates itself in-place to this new commit, after performing a fast-forward check. Although a fast-forward check requires walking commit history, most fast-forwards are only a few commits ahead of the prior position, making the check fast enough to do on demand for a client. Once the cache has been updated, other clients will not need to perform the check. * Short branch rewinds: If a branch rewinds to a prior version (or cuts to a different history), and in doing so does not eliminate any tags, the cache is updated in-place in real-time. Since the change did not impact any tags, the cache is still valid and can continue to be reused. * Branch deletion: If a branch is deleted, the cache is not updated. The deleted branch is simply ignored in the cache. If a great number of branches are deleted and the cache is wasting memory, site administrators should flush the "git_tags" cache and force a rebuild. However, since a branch costs just 1 bit per tag, plus the size of the branch name string, it would require deleting 75,000 branches before its even worth considering flushing the cache manually... as 75,000 branches is about 5 MB of storage. * Branch creation at another branch tip: If a new branch is created and points to the same commit as an existing branch, the cache is updated by cloning itself and adding the new branch for all tags reachable from the source branch. To keep things thread-safe in memory with minimal locking, this type of update requires making a full copy of the cache's data and is therefore more expensive than the prior update techniques, but is significantly cheaper than a full rebuild. There are some nasty corner cases the cache does not try to handle. For these we just suffer through a full rebuild: * Branch creation not at another branch tip: Since the newly created branch does not exactly match another branch, the project history must be scanned and computed to determine which tags are reachable from which branches. There is no optimization available for creating a new branch at an existing tag, so those cases also force a rebuild. * New tag: Since the tag position is not exactly known, which branches can reach it is also not known. The project history is scanned to rebuild the cache. Unlike the prior version of VisibleRefFilter, a rebuild of the cache only walks the history of the project once. This makes a rebuild slightly faster (5s now vs. 7s before). Cache updates occur automatically as a result of observed changes in the Git repository references. Doing these updates live just before sending the advertisement to a client ensure the cache's reachability data accurately represents the advertisement the client will receive. It also ensures any updates made directly to the repository (without Gerrit being involved) is still properly reflected in the result. We try to optimize updates for two common cases: change submission and direct push fast-forwards. Both attempt to update the cache immediately after the reference has been updated. This saves clients from needing to perform the fast-forward check themselves, as the change submission or receive code already performed the check and has the results on hand. There is still room for future improvement: Adding a new tag to the cache could be computed more quickly by copying the old cache, and doing a walk from all branch heads up to the new tag, but stopping at the new tag. This does not always work well when there are disconnected branches in the repository, as those unrelated branches will scan to the roots before terminating, making the update nearly the cost of full rebuild. Adding a new branch not at the exact same position as another branch could be computed more quickly by scanning all commits in the new branch (to see if tags were added), and all commits in existing branches (to see if tags should be excluded), and stopping at the merge base of the two sets. Any tags that are not excluded from the other branches but are reachable from the other branches should also be reachable from the new branch. However this is a tricky concept, as one must consider unrelated branches, so this operation may need to be performed once per unique set of branches that a tag can reach. Since this may require more than one history traversal, and if the branches are unrelated, a traversal all the way to the root, its nearly as expensive as the full rebuild option. Since its a lot more complex than a full rebuild, I am punting on this and may never implement it. Change-Id: I519a39ade2742de02003d9dedd17a8edca5c4923
Description
RETIRED, Gerrit as used by OpenStack