blob: 4bf84b5836cc66bc13d66fb700b8a9fbb98e1cf5 [file] [log] [blame]
= Gerrit Code Review - Repository Maintenance
== Description
Each project in Gerrit is stored in a bare Git repository. Gerrit uses
the JGit library to access (read and write to) these Git repositories.
As modifications are made to a project, Git repository maintenance will
be needed or performance will eventually suffer. When using the Git
command line tool to operate on a Git repository, it will run `git gc`
every now and then on the repository to ensure that Git garbage
collection is performed. However regular maintenance does not happen as
a result of normal Gerrit operations, so this is something that Gerrit
administrators need to plan for.
Gerrit has a built-in feature which allows it to run Git garbage
collection on repositories. This can be
link:config-gerrit.html#gc[configured] to run on a regular basis, and/or
this can be run manually with the link:cmd-gc.html[gerrit gc] ssh
command, or with the link:rest-api-projects.html#run-gc[run-gc] REST API.
Some administrators will opt to run `git gc` or `jgit gc` outside of
Gerrit instead. There are many reasons this might be done, the main one
likely being that when it is run in Gerrit it can be very resource
intensive and scheduling an external job to run Git garbage collection
allows administrators to finely tune the approach and resource usage of
this maintenance.
== Git Garbage Collection Impacts
Unlike a typical server database, access to Git repositories is not
marshalled through a single process or a set of inter communicating
processes. Unfortunately the design of the on-disk layout of a Git
repository does not allow for 100% race free operations when accessed by
multiple actors concurrently. These design shortcomings are more likely
to impact the operations of busy repositories since racy conditions are
more likely to occur when there are more concurrent operations. Since
most Gerrit servers are expected to run without interruptions, Git
garbage collection likely needs to be run during normal operational hours.
When it runs, it adds to the concurrency of the overall accesses. Given
that many of the operations in garbage collection involve deleting files
and directories, it has a higher chance of impacting other ongoing
operations than most other operations.
=== Interrupted Operations
When Git garbage collection deletes a file or directory that is
currently in use by an ongoing operation, it can cause that operation to
fail. These sorts of failures are often single shot failures, i.e. the
operation will succeed if tried again. An example of such a failure is
when a pack file is deleted while Gerrit is sending an object in the
file over the network to a user performing a clone or fetch. Usually
pack files are only deleted when the referenced objects in them have
been repacked and thus copied to a new pack file. So performing the same
operation again after the fetch will likely send the same object from
the new pack instead of the deleted one, and the operation will succeed.
=== Data Loss
It is possible for data loss to occur when Git garbage collection runs.
This is very rare, but it can happen. This can happen when an object is
believed to be unreferenced when object repacking is running, and then
garbage collection deletes it. This can happen because even though an
object may indeed be unreferenced when object repacking begins and
reachability of all objects is determined, it can become referenced by
another concurrent operation after this unreferenced determination but
before it gets deleted. When this happens, a new reference can be
created which points to a now missing object, and this will result in a
== Reducing Git Garbage Collection Impacts
JGit has a `preserved` directory feature which is intended to reduce
some of the impacts of Git garbage collection, and Gerrit can take
advantage of the feature too. The `preserved` directory is a
subdirectory of a repository's `objects/pack` directory where JGit will
move pack files that it would normally delete when `jgit gc` is invoked
with the `--preserve-oldpacks` option. It will later delete these files
the next time that `jgit gc` is run if it is invoked with the
`--prune-preserved` option. Using these flags together on every `jgit gc`
invocation means that packfiles will get an extended lifetime by one
full garbage collection cycle. Since an atomic move is used to move these
files, any open references to them will continue to work, even on NFS. On
a busy repository, preserving pack files can make operations much more
reliable, and interrupted operations should almost entirely disappear.
Moving files to the `preserved` directory also has the ability to reduce
data loss. If JGit cannot find an object it needs in its current object
DB, it will look into the `preserved` directory as a last resort. If it
finds the object in a pack file there, it will restore the
slated-to-be-deleted pack file back to the original `objects/pack`
directory effectively "undeleting" it and making all the objects in it
available again. When this happens, data loss is prevented.
One advantage of restoring preserved packfiles in this way when an
object is referenced in them, is that it makes loosening unreferenced
objects during Git garbage collection, which is a potentially expensive,
wasteful, and performance impacting operation, no longer desirable. It
is recommended that if you use Git for garbage collection, that you use
the `-a` option to `git repack` instead of the `-A` option to no longer
perform this loosening.
When Git is used for garbage collection instead of JGit, it is fairly
easy to wrap `git gc` or `git repack` with a small script which has a
`--prune-preserved` option which behaves as mentioned above by deleting
any pack files currently in the preserved directory, and also has a
`--preserve-oldpacks` option which then hardlinks all the currently
existing pack files from the `objects/pack` directory into the
`preserved` directory right before calling the real Git command. This
approach will then behave similarly to `jgit gc` with respect to
preserving pack files.
Part of link:index.html[Gerrit Code Review]