Add cache for tag advertisements
When a client asks the server for access to a Git repository, the
server has to compute a list of tags to advertise to the client.
It is an expensive computation to determine which tags are reachable
from the branches the client has READ permission on. For large
repositories with many tags the computation can take a considerable
amount of time, bordering on several seconds per connection. To make
the general case more efficient, introduce a cache called "git_tags".
On a trivial usage of the Linux kernel repository, the average
running time of the VisibleRefFilter when caches were hot was
7195.68 ms. With this commit, it is a mere 5.07 milliseconds
on a hot cache. A reduction of 99% of the running time.
The caches performs incremental updates under certain situations,
making common updates relatively painless for waiting clients:
* Branch fast-forward / change submission:
When a branch fast-forwards to a new commit, or a change is
submitted to a branch, all prior tags that are reachable from
that branch are still reachable. The cache updates itself
in-place to this new commit, after performing a fast-forward
check. Although a fast-forward check requires walking commit
history, most fast-forwards are only a few commits ahead of the
prior position, making the check fast enough to do on demand
for a client. Once the cache has been updated, other clients
will not need to perform the check.
* Short branch rewinds:
If a branch rewinds to a prior version (or cuts to a different
history), and in doing so does not eliminate any tags, the
cache is updated in-place in real-time. Since the change did
not impact any tags, the cache is still valid and can continue
to be reused.
* Branch deletion:
If a branch is deleted, the cache is not updated. The deleted
branch is simply ignored in the cache. If a great number of
branches are deleted and the cache is wasting memory, site
administrators should flush the "git_tags" cache and force a
rebuild. However, since a branch costs just 1 bit per tag, plus
the size of the branch name string, it would require deleting
75,000 branches before its even worth considering flushing the
cache manually... as 75,000 branches is about 5 MB of storage.
* Branch creation at another branch tip:
If a new branch is created and points to the same commit as
an existing branch, the cache is updated by cloning itself
and adding the new branch for all tags reachable from the
source branch. To keep things thread-safe in memory with
minimal locking, this type of update requires making a full
copy of the cache's data and is therefore more expensive than
the prior update techniques, but is significantly cheaper than
a full rebuild.
There are some nasty corner cases the cache does not try to
handle. For these we just suffer through a full rebuild:
* Branch creation not at another branch tip:
Since the newly created branch does not exactly match another
branch, the project history must be scanned and computed to
determine which tags are reachable from which branches. There
is no optimization available for creating a new branch at
an existing tag, so those cases also force a rebuild.
* New tag:
Since the tag position is not exactly known, which branches can
reach it is also not known. The project history is scanned to
rebuild the cache.
Unlike the prior version of VisibleRefFilter, a rebuild of the
cache only walks the history of the project once. This makes a
rebuild slightly faster (5s now vs. 7s before).
Cache updates occur automatically as a result of observed changes in
the Git repository references. Doing these updates live just before
sending the advertisement to a client ensure the cache's reachability
data accurately represents the advertisement the client will receive.
It also ensures any updates made directly to the repository (without
Gerrit being involved) is still properly reflected in the result.
We try to optimize updates for two common cases: change submission
and direct push fast-forwards. Both attempt to update the cache
immediately after the reference has been updated. This saves clients
from needing to perform the fast-forward check themselves, as the
change submission or receive code already performed the check and
has the results on hand.
There is still room for future improvement:
Adding a new tag to the cache could be computed more quickly by
copying the old cache, and doing a walk from all branch heads up
to the new tag, but stopping at the new tag. This does not always
work well when there are disconnected branches in the repository,
as those unrelated branches will scan to the roots before
terminating, making the update nearly the cost of full rebuild.
Adding a new branch not at the exact same position as another
branch could be computed more quickly by scanning all commits
in the new branch (to see if tags were added), and all commits
in existing branches (to see if tags should be excluded), and
stopping at the merge base of the two sets. Any tags that are not
excluded from the other branches but are reachable from the other
branches should also be reachable from the new branch. However
this is a tricky concept, as one must consider unrelated branches,
so this operation may need to be performed once per unique set
of branches that a tag can reach. Since this may require more
than one history traversal, and if the branches are unrelated,
a traversal all the way to the root, its nearly as expensive as
the full rebuild option. Since its a lot more complex than a full
rebuild, I am punting on this and may never implement it.
Change-Id: I519a39ade2742de02003d9dedd17a8edca5c4923
12 files changed