| = Gerrit Code Review - Repository Maintenance |
| |
| == Description |
| |
| Each project in Gerrit is stored in a bare Git repository. Gerrit uses |
| the JGit library to access (read and write to) these Git repositories. |
| As modifications are made to a project, Git repository maintenance will |
| be needed or performance will eventually suffer. When using the Git |
| command line tool to operate on a Git repository, it will run `git gc` |
| every now and then on the repository to ensure that Git garbage |
| collection is performed. However regular maintenance does not happen as |
| a result of normal Gerrit operations, so this is something that Gerrit |
| administrators need to plan for. |
| |
| Gerrit has a built-in feature which allows it to run Git garbage |
| collection on repositories. This can be |
| link:config-gerrit.html#gc[configured] to run on a regular basis, and/or |
| this can be run manually with the link:cmd-gc.html[gerrit gc] ssh |
| command, or with the link:rest-api-projects.html#run-gc[run-gc] REST API. |
| Some administrators will opt to run `git gc` or `jgit gc` outside of |
| Gerrit instead. There are many reasons this might be done, the main one |
| likely being that when it is run in Gerrit it can be very resource |
| intensive and scheduling an external job to run Git garbage collection |
| allows administrators to finely tune the approach and resource usage of |
| this maintenance. |
| |
| == Git Garbage Collection Impacts |
| |
| Unlike a typical server database, access to Git repositories is not |
| marshalled through a single process or a set of inter communicating |
| processes. Unfortunately the design of the on-disk layout of a Git |
| repository does not allow for 100% race free operations when accessed by |
| multiple actors concurrently. These design shortcomings are more likely |
| to impact the operations of busy repositories since racy conditions are |
| more likely to occur when there are more concurrent operations. Since |
| most Gerrit servers are expected to run without interruptions, Git |
| garbage collection likely needs to be run during normal operational hours. |
| When it runs, it adds to the concurrency of the overall accesses. Given |
| that many of the operations in garbage collection involve deleting files |
| and directories, it has a higher chance of impacting other ongoing |
| operations than most other operations. |
| |
| === Interrupted Operations |
| |
| When Git garbage collection deletes a file or directory that is |
| currently in use by an ongoing operation, it can cause that operation to |
| fail. These sorts of failures are often single shot failures, i.e. the |
| operation will succeed if tried again. An example of such a failure is |
| when a pack file is deleted while Gerrit is sending an object in the |
| file over the network to a user performing a clone or fetch. Usually |
| pack files are only deleted when the referenced objects in them have |
| been repacked and thus copied to a new pack file. So performing the same |
| operation again after the fetch will likely send the same object from |
| the new pack instead of the deleted one, and the operation will succeed. |
| |
| === Data Loss |
| |
| It is possible for data loss to occur when Git garbage collection runs. |
| This is very rare, but it can happen. This can happen when an object is |
| believed to be unreferenced when object repacking is running, and then |
| garbage collection deletes it. This can happen because even though an |
| object may indeed be unreferenced when object repacking begins and |
| reachability of all objects is determined, it can become referenced by |
| another concurrent operation after this unreferenced determination but |
| before it gets deleted. When this happens, a new reference can be |
| created which points to a now missing object, and this will result in a |
| loss. |
| |
| == Reducing Git Garbage Collection Impacts |
| |
| JGit has a `preserved` directory feature which is intended to reduce |
| some of the impacts of Git garbage collection, and Gerrit can take |
| advantage of the feature too. The `preserved` directory is a |
| subdirectory of a repository's `objects/pack` directory where JGit will |
| move pack files that it would normally delete when `jgit gc` is invoked |
| with the `--preserve-oldpacks` option. It will later delete these files |
| the next time that `jgit gc` is run if it is invoked with the |
| `--prune-preserved` option. Using these flags together on every `jgit gc` |
| invocation means that packfiles will get an extended lifetime by one |
| full garbage collection cycle. Since an atomic move is used to move these |
| files, any open references to them will continue to work, even on NFS. On |
| a busy repository, preserving pack files can make operations much more |
| reliable, and interrupted operations should almost entirely disappear. |
| |
| Moving files to the `preserved` directory also has the ability to reduce |
| data loss. If JGit cannot find an object it needs in its current object |
| DB, it will look into the `preserved` directory as a last resort. If it |
| finds the object in a pack file there, it will restore the |
| slated-to-be-deleted pack file back to the original `objects/pack` |
| directory effectively "undeleting" it and making all the objects in it |
| available again. When this happens, data loss is prevented. |
| |
| One advantage of restoring preserved packfiles in this way when an |
| object is referenced in them, is that it makes loosening unreferenced |
| objects during Git garbage collection, which is a potentially expensive, |
| wasteful, and performance impacting operation, no longer desirable. It |
| is recommended that if you use Git for garbage collection, that you use |
| the `-a` option to `git repack` instead of the `-A` option to no longer |
| perform this loosening. |
| |
| When Git is used for garbage collection instead of JGit, it is fairly |
| easy to wrap `git gc` or `git repack` with a small script which has a |
| `--prune-preserved` option which behaves as mentioned above by deleting |
| any pack files currently in the preserved directory, and also has a |
| `--preserve-oldpacks` option which then hardlinks all the currently |
| existing pack files from the `objects/pack` directory into the |
| `preserved` directory right before calling the real Git command. This |
| approach will then behave similarly to `jgit gc` with respect to |
| preserving pack files. An implementation is available |
| link:https://gerrit.googlesource.com/gerrit/+/refs/heads/master/contrib/git-gc-preserve[here]. |
| |
| GERRIT |
| ------ |
| Part of link:index.html[Gerrit Code Review] |
| |
| SEARCHBOX |
| --------- |