blob: 9bc7c248b2ae1fc885c8e82ea69025b77d9fd628 [file] [log] [blame]
= Gerrit Code Review - Repository Maintenance
== Description
Each project in Gerrit is stored in a bare Git repository. Gerrit uses
the JGit library to access (read and write to) these Git repositories.
As modifications are made to a project, Git repository maintenance will
be needed or performance will eventually suffer. When using the Git
command line tool to operate on a Git repository, it will run `git gc`
every now and then on the repository to ensure that Git garbage
collection is performed. However regular maintenance does not happen as
a result of normal Gerrit operations, so this is something that Gerrit
administrators need to plan for.
Gerrit has a built-in feature which allows it to run Git garbage
collection on repositories. This can be
link:config-gerrit.html#gc[configured] to run on a regular basis, and/or
this can be run manually with the link:cmd-gc.html[gerrit gc] ssh
command, or with the link:rest-api-projects.html#run-gc[run-gc] REST API.
Some administrators will opt to run `git gc` or `jgit gc` outside of
Gerrit instead. There are many reasons this might be done, the main one
likely being that when it is run in Gerrit it can be very resource
intensive and scheduling an external job to run Git garbage collection
allows administrators to finely tune the approach and resource usage of
this maintenance.
== Git Garbage Collection Impacts
Unlike a typical server database, access to Git repositories is not
marshalled through a single process or a set of inter communicating
processes. Unfortuntatlely the design of the on-disk layout of a Git
repository does not allow for 100% race free operations when accessed by
multiple actors concurrently. These design shortcomings are more likely
to impact the operations of busy repositories since racy conditions are
more likely to occur when there are more concurrent operations. Since
most Gerrit servers are expected to run without interruptions, Git
garbage collection likely needs to be run during normal operational hours.
When it runs, it adds to the concurrency of the overall accesses. Given
that many of the operations in garbage collection involve deleting files
and directories, it has a higher chance of impacting other ongoing
operations than most other operations.
=== Interrupted Operations
When Git garbage collection deletes a file or directory that is
currently in use by an ongoing operation, it can cause that operation to
fail. These sorts of failures are often single shot failures, i.e. the
operation will succeed if tried again. An example of such a failure is
when a pack file is deleted while Gerrit is sending an object in the
file over the network to a user performing a clone or fetch. Usually
pack files are only deleted when the referenced objects in them have
been repacked and thus copied to a new pack file. So performing the same
operation again after the fetch will likely send the same object from
the new pack instead of the deleted one, and the operation will succeed.
=== Data Loss
It is possible for data loss to occur when Git garbage collection runs.
This is very rare, but it can happen. This can happen when an object is
believed to be unreferenced when object repacking is running, and then
garbage collection deletes it. This can happen because even though an
object may indeed be unreferenced when object repacking begins and
reachability of all objects is determined, it can become referenced by
another concurrent operation after this unreferenced determination but
before it gets deleted. When this happens, a new reference can be
created which points to a now missing object, and this will result in a
loss.
GERRIT
------
Part of link:index.html[Gerrit Code Review]
SEARCHBOX
---------