Martin Fick | 70759b6 | 2021-05-07 14:59:36 -0600 | [diff] [blame] | 1 | = Gerrit Code Review - Repository Maintenance |
| 2 | |
| 3 | == Description |
| 4 | |
| 5 | Each project in Gerrit is stored in a bare Git repository. Gerrit uses |
| 6 | the JGit library to access (read and write to) these Git repositories. |
| 7 | As modifications are made to a project, Git repository maintenance will |
| 8 | be needed or performance will eventually suffer. When using the Git |
| 9 | command line tool to operate on a Git repository, it will run `git gc` |
| 10 | every now and then on the repository to ensure that Git garbage |
| 11 | collection is performed. However regular maintenance does not happen as |
| 12 | a result of normal Gerrit operations, so this is something that Gerrit |
| 13 | administrators need to plan for. |
| 14 | |
| 15 | Gerrit has a built-in feature which allows it to run Git garbage |
| 16 | collection on repositories. This can be |
| 17 | link:config-gerrit.html#gc[configured] to run on a regular basis, and/or |
| 18 | this can be run manually with the link:cmd-gc.html[gerrit gc] ssh |
| 19 | command, or with the link:rest-api-projects.html#run-gc[run-gc] REST API. |
| 20 | Some administrators will opt to run `git gc` or `jgit gc` outside of |
| 21 | Gerrit instead. There are many reasons this might be done, the main one |
| 22 | likely being that when it is run in Gerrit it can be very resource |
| 23 | intensive and scheduling an external job to run Git garbage collection |
| 24 | allows administrators to finely tune the approach and resource usage of |
| 25 | this maintenance. |
| 26 | |
Martin Fick | f9029a8 | 2021-05-07 16:34:04 -0600 | [diff] [blame] | 27 | == Git Garbage Collection Impacts |
| 28 | |
| 29 | Unlike a typical server database, access to Git repositories is not |
| 30 | marshalled through a single process or a set of inter communicating |
Kenyon Ralph | c2a0c65c | 2021-07-23 13:23:52 -0700 | [diff] [blame] | 31 | processes. Unfortunately the design of the on-disk layout of a Git |
Martin Fick | f9029a8 | 2021-05-07 16:34:04 -0600 | [diff] [blame] | 32 | repository does not allow for 100% race free operations when accessed by |
| 33 | multiple actors concurrently. These design shortcomings are more likely |
| 34 | to impact the operations of busy repositories since racy conditions are |
| 35 | more likely to occur when there are more concurrent operations. Since |
| 36 | most Gerrit servers are expected to run without interruptions, Git |
| 37 | garbage collection likely needs to be run during normal operational hours. |
| 38 | When it runs, it adds to the concurrency of the overall accesses. Given |
| 39 | that many of the operations in garbage collection involve deleting files |
| 40 | and directories, it has a higher chance of impacting other ongoing |
| 41 | operations than most other operations. |
| 42 | |
| 43 | === Interrupted Operations |
| 44 | |
| 45 | When Git garbage collection deletes a file or directory that is |
| 46 | currently in use by an ongoing operation, it can cause that operation to |
| 47 | fail. These sorts of failures are often single shot failures, i.e. the |
| 48 | operation will succeed if tried again. An example of such a failure is |
| 49 | when a pack file is deleted while Gerrit is sending an object in the |
| 50 | file over the network to a user performing a clone or fetch. Usually |
| 51 | pack files are only deleted when the referenced objects in them have |
| 52 | been repacked and thus copied to a new pack file. So performing the same |
| 53 | operation again after the fetch will likely send the same object from |
| 54 | the new pack instead of the deleted one, and the operation will succeed. |
| 55 | |
| 56 | === Data Loss |
| 57 | |
| 58 | It is possible for data loss to occur when Git garbage collection runs. |
| 59 | This is very rare, but it can happen. This can happen when an object is |
| 60 | believed to be unreferenced when object repacking is running, and then |
| 61 | garbage collection deletes it. This can happen because even though an |
| 62 | object may indeed be unreferenced when object repacking begins and |
| 63 | reachability of all objects is determined, it can become referenced by |
| 64 | another concurrent operation after this unreferenced determination but |
| 65 | before it gets deleted. When this happens, a new reference can be |
| 66 | created which points to a now missing object, and this will result in a |
| 67 | loss. |
| 68 | |
Martin Fick | bbc491c | 2021-05-07 17:17:06 -0600 | [diff] [blame] | 69 | == Reducing Git Garbage Collection Impacts |
| 70 | |
| 71 | JGit has a `preserved` directory feature which is intended to reduce |
| 72 | some of the impacts of Git garbage collection, and Gerrit can take |
| 73 | advantage of the feature too. The `preserved` directory is a |
| 74 | subdirectory of a repository's `objects/pack` directory where JGit will |
| 75 | move pack files that it would normally delete when `jgit gc` is invoked |
| 76 | with the `--preserve-oldpacks` option. It will later delete these files |
| 77 | the next time that `jgit gc` is run if it is invoked with the |
| 78 | `--prune-preserved` option. Using these flags together on every `jgit gc` |
| 79 | invocation means that packfiles will get an extended lifetime by one |
| 80 | full garbage collection cycle. Since an atomic move is used to move these |
| 81 | files, any open references to them will continue to work, even on NFS. On |
| 82 | a busy repository, preserving pack files can make operations much more |
| 83 | reliable, and interrupted operations should almost entirely disappear. |
| 84 | |
| 85 | Moving files to the `preserved` directory also has the ability to reduce |
| 86 | data loss. If JGit cannot find an object it needs in its current object |
| 87 | DB, it will look into the `preserved` directory as a last resort. If it |
| 88 | finds the object in a pack file there, it will restore the |
| 89 | slated-to-be-deleted pack file back to the original `objects/pack` |
| 90 | directory effectively "undeleting" it and making all the objects in it |
| 91 | available again. When this happens, data loss is prevented. |
| 92 | |
| 93 | One advantage of restoring preserved packfiles in this way when an |
| 94 | object is referenced in them, is that it makes loosening unreferenced |
| 95 | objects during Git garbage collection, which is a potentially expensive, |
| 96 | wasteful, and performance impacting operation, no longer desirable. It |
| 97 | is recommended that if you use Git for garbage collection, that you use |
| 98 | the `-a` option to `git repack` instead of the `-A` option to no longer |
| 99 | perform this loosening. |
| 100 | |
| 101 | When Git is used for garbage collection instead of JGit, it is fairly |
| 102 | easy to wrap `git gc` or `git repack` with a small script which has a |
| 103 | `--prune-preserved` option which behaves as mentioned above by deleting |
| 104 | any pack files currently in the preserved directory, and also has a |
| 105 | `--preserve-oldpacks` option which then hardlinks all the currently |
| 106 | existing pack files from the `objects/pack` directory into the |
| 107 | `preserved` directory right before calling the real Git command. This |
| 108 | approach will then behave similarly to `jgit gc` with respect to |
| 109 | preserving pack files. |
| 110 | |
Martin Fick | 70759b6 | 2021-05-07 14:59:36 -0600 | [diff] [blame] | 111 | GERRIT |
| 112 | ------ |
| 113 | Part of link:index.html[Gerrit Code Review] |
| 114 | |
| 115 | SEARCHBOX |
| 116 | --------- |