blob: 4bf84b5836cc66bc13d66fb700b8a9fbb98e1cf5 [file] [log] [blame]
Martin Fick70759b62021-05-07 14:59:36 -06001= Gerrit Code Review - Repository Maintenance
2
3== Description
4
5Each project in Gerrit is stored in a bare Git repository. Gerrit uses
6the JGit library to access (read and write to) these Git repositories.
7As modifications are made to a project, Git repository maintenance will
8be needed or performance will eventually suffer. When using the Git
9command line tool to operate on a Git repository, it will run `git gc`
10every now and then on the repository to ensure that Git garbage
11collection is performed. However regular maintenance does not happen as
12a result of normal Gerrit operations, so this is something that Gerrit
13administrators need to plan for.
14
15Gerrit has a built-in feature which allows it to run Git garbage
16collection on repositories. This can be
17link:config-gerrit.html#gc[configured] to run on a regular basis, and/or
18this can be run manually with the link:cmd-gc.html[gerrit gc] ssh
19command, or with the link:rest-api-projects.html#run-gc[run-gc] REST API.
20Some administrators will opt to run `git gc` or `jgit gc` outside of
21Gerrit instead. There are many reasons this might be done, the main one
22likely being that when it is run in Gerrit it can be very resource
23intensive and scheduling an external job to run Git garbage collection
24allows administrators to finely tune the approach and resource usage of
25this maintenance.
26
Martin Fickf9029a82021-05-07 16:34:04 -060027== Git Garbage Collection Impacts
28
29Unlike a typical server database, access to Git repositories is not
30marshalled through a single process or a set of inter communicating
Kenyon Ralphc2a0c65c2021-07-23 13:23:52 -070031processes. Unfortunately the design of the on-disk layout of a Git
Martin Fickf9029a82021-05-07 16:34:04 -060032repository does not allow for 100% race free operations when accessed by
33multiple actors concurrently. These design shortcomings are more likely
34to impact the operations of busy repositories since racy conditions are
35more likely to occur when there are more concurrent operations. Since
36most Gerrit servers are expected to run without interruptions, Git
37garbage collection likely needs to be run during normal operational hours.
38When it runs, it adds to the concurrency of the overall accesses. Given
39that many of the operations in garbage collection involve deleting files
40and directories, it has a higher chance of impacting other ongoing
41operations than most other operations.
42
43=== Interrupted Operations
44
45When Git garbage collection deletes a file or directory that is
46currently in use by an ongoing operation, it can cause that operation to
47fail. These sorts of failures are often single shot failures, i.e. the
48operation will succeed if tried again. An example of such a failure is
49when a pack file is deleted while Gerrit is sending an object in the
50file over the network to a user performing a clone or fetch. Usually
51pack files are only deleted when the referenced objects in them have
52been repacked and thus copied to a new pack file. So performing the same
53operation again after the fetch will likely send the same object from
54the new pack instead of the deleted one, and the operation will succeed.
55
56=== Data Loss
57
58It is possible for data loss to occur when Git garbage collection runs.
59This is very rare, but it can happen. This can happen when an object is
60believed to be unreferenced when object repacking is running, and then
61garbage collection deletes it. This can happen because even though an
62object may indeed be unreferenced when object repacking begins and
63reachability of all objects is determined, it can become referenced by
64another concurrent operation after this unreferenced determination but
65before it gets deleted. When this happens, a new reference can be
66created which points to a now missing object, and this will result in a
67loss.
68
Martin Fickbbc491c2021-05-07 17:17:06 -060069== Reducing Git Garbage Collection Impacts
70
71JGit has a `preserved` directory feature which is intended to reduce
72some of the impacts of Git garbage collection, and Gerrit can take
73advantage of the feature too. The `preserved` directory is a
74subdirectory of a repository's `objects/pack` directory where JGit will
75move pack files that it would normally delete when `jgit gc` is invoked
76with the `--preserve-oldpacks` option. It will later delete these files
77the next time that `jgit gc` is run if it is invoked with the
78`--prune-preserved` option. Using these flags together on every `jgit gc`
79invocation means that packfiles will get an extended lifetime by one
80full garbage collection cycle. Since an atomic move is used to move these
81files, any open references to them will continue to work, even on NFS. On
82a busy repository, preserving pack files can make operations much more
83reliable, and interrupted operations should almost entirely disappear.
84
85Moving files to the `preserved` directory also has the ability to reduce
86data loss. If JGit cannot find an object it needs in its current object
87DB, it will look into the `preserved` directory as a last resort. If it
88finds the object in a pack file there, it will restore the
89slated-to-be-deleted pack file back to the original `objects/pack`
90directory effectively "undeleting" it and making all the objects in it
91available again. When this happens, data loss is prevented.
92
93One advantage of restoring preserved packfiles in this way when an
94object is referenced in them, is that it makes loosening unreferenced
95objects during Git garbage collection, which is a potentially expensive,
96wasteful, and performance impacting operation, no longer desirable. It
97is recommended that if you use Git for garbage collection, that you use
98the `-a` option to `git repack` instead of the `-A` option to no longer
99perform this loosening.
100
101When Git is used for garbage collection instead of JGit, it is fairly
102easy to wrap `git gc` or `git repack` with a small script which has a
103`--prune-preserved` option which behaves as mentioned above by deleting
104any pack files currently in the preserved directory, and also has a
105`--preserve-oldpacks` option which then hardlinks all the currently
106existing pack files from the `objects/pack` directory into the
107`preserved` directory right before calling the real Git command. This
108approach will then behave similarly to `jgit gc` with respect to
109preserving pack files.
110
Martin Fick70759b62021-05-07 14:59:36 -0600111GERRIT
112------
113Part of link:index.html[Gerrit Code Review]
114
115SEARCHBOX
116---------