gerrit

Go to file

Shawn O. Pearce 2d65d29d80 Add cache for tag advertisements

When a client asks the server for access to a Git repository, the
server has to compute a list of tags to advertise to the client.
It is an expensive computation to determine which tags are reachable
from the branches the client has READ permission on.  For large
repositories with many tags the computation can take a considerable
amount of time, bordering on several seconds per connection. To make
the general case more efficient, introduce a cache called "git_tags".

On a trivial usage of the Linux kernel repository, the average
running time of the VisibleRefFilter when caches were hot was
7195.68 ms.  With this commit, it is a mere 5.07 milliseconds
on a hot cache.  A reduction of 99% of the running time.

The caches performs incremental updates under certain situations,
making common updates relatively painless for waiting clients:

  * Branch fast-forward / change submission:

    When a branch fast-forwards to a new commit, or a change is
    submitted to a branch, all prior tags that are reachable from
    that branch are still reachable.  The cache updates itself
    in-place to this new commit, after performing a fast-forward
    check.  Although a fast-forward check requires walking commit
    history, most fast-forwards are only a few commits ahead of the
    prior position, making the check fast enough to do on demand
    for a client.  Once the cache has been updated, other clients
    will not need to perform the check.

  * Short branch rewinds:

    If a branch rewinds to a prior version (or cuts to a different
    history), and in doing so does not eliminate any tags, the
    cache is updated in-place in real-time.  Since the change did
    not impact any tags, the cache is still valid and can continue
    to be reused.

  * Branch deletion:

    If a branch is deleted, the cache is not updated.  The deleted
    branch is simply ignored in the cache.  If a great number of
    branches are deleted and the cache is wasting memory, site
    administrators should flush the "git_tags" cache and force a
    rebuild.  However, since a branch costs just 1 bit per tag, plus
    the size of the branch name string, it would require deleting
    75,000 branches before its even worth considering flushing the
    cache manually... as 75,000 branches is about 5 MB of storage.

  * Branch creation at another branch tip:

    If a new branch is created and points to the same commit as
    an existing branch, the cache is updated by cloning itself
    and adding the new branch for all tags reachable from the
    source branch.  To keep things thread-safe in memory with
    minimal locking, this type of update requires making a full
    copy of the cache's data and is therefore more expensive than
    the prior update techniques, but is significantly cheaper than
    a full rebuild.

There are some nasty corner cases the cache does not try to
handle. For these we just suffer through a full rebuild:

  * Branch creation not at another branch tip:

    Since the newly created branch does not exactly match another
    branch, the project history must be scanned and computed to
    determine which tags are reachable from which branches. There
    is no optimization available for creating a new branch at
    an existing tag, so those cases also force a rebuild.

  * New tag:

    Since the tag position is not exactly known, which branches can
    reach it is also not known. The project history is scanned to
    rebuild the cache.

Unlike the prior version of VisibleRefFilter, a rebuild of the
cache only walks the history of the project once. This makes a
rebuild slightly faster (5s now vs. 7s before).

Cache updates occur automatically as a result of observed changes in
the Git repository references. Doing these updates live just before
sending the advertisement to a client ensure the cache's reachability
data accurately represents the advertisement the client will receive.
It also ensures any updates made directly to the repository (without
Gerrit being involved) is still properly reflected in the result.

We try to optimize updates for two common cases: change submission
and direct push fast-forwards.  Both attempt to update the cache
immediately after the reference has been updated. This saves clients
from needing to perform the fast-forward check themselves, as the
change submission or receive code already performed the check and
has the results on hand.

There is still room for future improvement:

  Adding a new tag to the cache could be computed more quickly by
  copying the old cache, and doing a walk from all branch heads up
  to the new tag, but stopping at the new tag.  This does not always
  work well when there are disconnected branches in the repository,
  as those unrelated branches will scan to the roots before
  terminating, making the update nearly the cost of full rebuild.

  Adding a new branch not at the exact same position as another
  branch could be computed more quickly by scanning all commits
  in the new branch (to see if tags were added), and all commits
  in existing branches (to see if tags should be excluded), and
  stopping at the merge base of the two sets.  Any tags that are not
  excluded from the other branches but are reachable from the other
  branches should also be reachable from the new branch. However
  this is a tricky concept, as one must consider unrelated branches,
  so this operation may need to be performed once per unique set
  of branches that a tag can reach.  Since this may require more
  than one history traversal, and if the branches are unrelated,
  a traversal all the way to the root, its nearly as expensive as
  the full rebuild option. Since its a lot more complex than a full
  rebuild, I am punting on this and may never implement it.

Change-Id: I519a39ade2742de02003d9dedd17a8edca5c4923

2011-06-24 09:14:55 -07:00