| = Gerrit Code Review - Repository Maintenance |
| |
| == Description |
| |
| Each project in Gerrit is stored in a bare Git repository. Gerrit uses |
| the JGit library to access (read and write to) these Git repositories. |
| As modifications are made to a project, Git repository maintenance will |
| be needed or performance will eventually suffer. When using the Git |
| command line tool to operate on a Git repository, it will run `git gc` |
| every now and then on the repository to ensure that Git garbage |
| collection is performed. However regular maintenance does not happen as |
| a result of normal Gerrit operations, so this is something that Gerrit |
| administrators need to plan for. |
| |
| Gerrit has a built-in feature which allows it to run Git garbage |
| collection on repositories. This can be |
| link:config-gerrit.html#gc[configured] to run on a regular basis, and/or |
| this can be run manually with the link:cmd-gc.html[gerrit gc] ssh |
| command, or with the link:rest-api-projects.html#run-gc[run-gc] REST API. |
| Some administrators will opt to run `git gc` or `jgit gc` outside of |
| Gerrit instead. There are many reasons this might be done, the main one |
| likely being that when it is run in Gerrit it can be very resource |
| intensive and scheduling an external job to run Git garbage collection |
| allows administrators to finely tune the approach and resource usage of |
| this maintenance. |
| |
| == Git Garbage Collection Impacts |
| |
| Unlike a typical server database, access to Git repositories is not |
| marshalled through a single process or a set of inter communicating |
| processes. Unfortuntatlely the design of the on-disk layout of a Git |
| repository does not allow for 100% race free operations when accessed by |
| multiple actors concurrently. These design shortcomings are more likely |
| to impact the operations of busy repositories since racy conditions are |
| more likely to occur when there are more concurrent operations. Since |
| most Gerrit servers are expected to run without interruptions, Git |
| garbage collection likely needs to be run during normal operational hours. |
| When it runs, it adds to the concurrency of the overall accesses. Given |
| that many of the operations in garbage collection involve deleting files |
| and directories, it has a higher chance of impacting other ongoing |
| operations than most other operations. |
| |
| === Interrupted Operations |
| |
| When Git garbage collection deletes a file or directory that is |
| currently in use by an ongoing operation, it can cause that operation to |
| fail. These sorts of failures are often single shot failures, i.e. the |
| operation will succeed if tried again. An example of such a failure is |
| when a pack file is deleted while Gerrit is sending an object in the |
| file over the network to a user performing a clone or fetch. Usually |
| pack files are only deleted when the referenced objects in them have |
| been repacked and thus copied to a new pack file. So performing the same |
| operation again after the fetch will likely send the same object from |
| the new pack instead of the deleted one, and the operation will succeed. |
| |
| === Data Loss |
| |
| It is possible for data loss to occur when Git garbage collection runs. |
| This is very rare, but it can happen. This can happen when an object is |
| believed to be unreferenced when object repacking is running, and then |
| garbage collection deletes it. This can happen because even though an |
| object may indeed be unreferenced when object repacking begins and |
| reachability of all objects is determined, it can become referenced by |
| another concurrent operation after this unreferenced determination but |
| before it gets deleted. When this happens, a new reference can be |
| created which points to a now missing object, and this will result in a |
| loss. |
| |
| GERRIT |
| ------ |
| Part of link:index.html[Gerrit Code Review] |
| |
| SEARCHBOX |
| --------- |