blob: e1cd5b38b365dfa4526d61152bd7e8d85011c493 [file] [log] [blame] [view]
# Gerrit Maintenance
This package provides a set of tools that can be used to maintain a Gerrit site.
Some tools will also work with git repositories in general.
The following tools are available:
- [Extended Git GarbageCollection](#extended-git-garbagecollection)
## Dependencies
- Python > 3.12
For development, some additional python libraries are required. These are managed
with pipenv. To install them, run:
```sh
pipenv sync --dev
```
## Development
### Code Style
This package is formatted using `black`. To automatically format all python files,
run:
```sh
pipenv run black .
```
`flake8` is being used to identify code style issues. To run it, use:
```sh
pipenv run flake8 .
```
### Tests
To execute tests, run:
```sh
pipenv run pytest
```
## Usage
The gerrit-maintenance CLI provides a toolbox to run scripts for performing
maintenance tasks on a Gerrit site. The CLI uses a nested command structure. The
available commands will be described in the following sections.
To start the CLI, run:
```sh
pipenv run python ./gerrit-maintenance.py -d $SITE -h
```
At this level, the path to the Gerrit site has to be provided.
The next layer deals with the different aspects of a Gerrit site:
### Projects
This set of subcommands deals with maintaining the projects/repositories in
the Gerrit site. To get an overview of available commands, run:
```sh
pipenv run python ./gerrit-maintenance.py -d $SITE projects -h
```
By default the selected subcommand will run on all projects in the site, but the
list can be filtered by either selecting projects specifically
```sh
pipenv run python ./gerrit-maintenance.py \
-d $SITE \
projects \
--project All-Users \
--project All-Projects \
$CMD
```
or by skipping some projects
```sh
pipenv run python ./gerrit-maintenance.py \
-d $SITE \
projects \
--skip All-Users \
--skip All-Projects \
$CMD
```
The maintenance scripts available for projects are:
#### Git Garbage Collection
To run Git GC as part of the gerrit-maintenance CLI, run:
```sh
pipenv run python ./gerrit-maintenance.py \
-d $SITE \
projects \
gc
```
You may run it as well as a [standalone git extension](#extended-git-garbagecollection).
You can provide git configuration options to git gc using the `-c` option:
```sh
pipenv run python ./gerrit-maintenance.py \
-d $SITE \
projects \
gc \
-c repack.writebitmaps=false
```
As with the standalone git extension, all arguments provided in addition to the
ones known by the CLI will be forwarded to the `git gc` command, e.g. the following
command will suppress all progress reports logged by git:
```sh
pipenv run python ./gerrit-maintenance.py \
-d $SITE \
projects \
gc \
--quiet
```
The CLI also includes all extended features mentioned in [this section](#extended-features).
## Extended Git GarbageCollection
Git provides a GarbageCollection command (`git gc`) to clean up repositories.
Unfortunately, this command misses some cleanup steps that help improving
the performance of a repository.
The python script provided here wraps `git gc` and adds additional options and
cleanup steps.
### Dependencies
Refer to [general dependencies](#dependencies)
No non-standard libraries are being used to keep running this tool simple.
### Installation
Put this directory somewhere convenient and ensure that the `git-gcplus`
executable is present in the `PATH` environment variable, e.g. by symlinking it
to `/usr/local/bin`.
### Usage
The extended git gc can be called like any other git-command:
```sh
git gcplus
```
This will run the extended gc in the current working directory (if it is a
repository).
A specific repository can be set as usual using `-C`:
```sh
git -C "/var/gerrit/git/All-Users.git" gcplus
```
The repository configuration can also be overridden as usual:
```sh
git -c repack.writebitmaps=false gcplus
```
The script will further forward all [options](https://git-scm.com/docs/git-gc#_options)
provided by the `git gc` command to the included `git gc` run, e.g. the following
command will suppress all progress reports written by git:
```sh
git gcplus --quiet
```
The extended git gc script also adds a few more options:
- `--pack-all-refs` / `-r`
### Extended features
#### Packing all refs
Enabled by: `--pack-all-refs` / `-r`
Git gc by default only packs refs that are already packed. That potentially
leaves a lot of loose refs in large projects, some of which are not actively
being used anymore.
Enabling this feature conveniently runs `git pack-refs --all`, if there are more
than 10 loose refs after the `git-gc` run.
#### Preserving packs
Enabled by configuring `gc.preserveoldpacks = true`
As part of git gc packs are rewritten, which includes the change of the pack names.
If a long running request accesses a pack that is being recreated in this way
while the request is running, the request can fail, because the server tries
and fails to access the now deleted old pack. This can lead to a significant
amount of failing requests on large repositories and greatly inconvenience users.
Jgit provides a feature to prevent the above described scenario by allowing to
preserve packs. This is done by hardlinking them before the gc and falling back
to the preserved pack in case a request fails to find a pack. Unfortunately, this
is not supported by native git.
This extended gc script adds support for the following options added by jgit:
- `gc.preserveoldpacks`: Whether to preserve packs before running `git gc`.
- `gc.prunepreserved`: Whether to prune preserved packs created by previous runs.
Setting those options will prevent failures as described above, if the server uses
jgit (e.g. Gerrit), at a cost of using more storage.
#### Lock handling
Enabled: Always
Git guards gc by locking a lock file "gc.pid" before starting execution.
The lock file contains the pid and hostname of the process holding the
lock. Git tries to kill the process holding that lock if the lock file
wasn't modified in the last 12 hours and was started from the same host.
This does not work in a scenario where git gc is running in an ephemeral
environment like Kubernetes, where the host might actually always be different,
e.g. if git gc is running in a Kubernetes CronJob on a repository in a shared
filesystem.
The extended git gc will always delete the lock, if it hasn't been modified for
at least 12 h. This matches the behavior of jgit.
#### Deletion of empty ref directories
Enabled: Always
Git gc might leave empty directories after packing refs. This happens if all refs
in a namespace have been packed. This potentially leaves thousands of empty
directories, especially with Gerrit's NoteDB. This can cause significant performance
issues on slow filesystems like NFS.
The extended gc will delete empty ref directories older than 1h.
#### Deletion of stale incoming packs
Enabled: Always
If a git server crashes while still serving push requests the temporary incoming
pack file will never be cleaned up, unnecessarily cluttering the repository.
The extended gc will consider incoming packs not modified for 1 day to be stale
and delete them.
#### Using a marker file to enable aggressive gc
Enabled by creating a file named `gc-aggressive` or `gc-aggressive-once` in the
repository's `.git` directory.
In some use cases an aggressive GC should be run for a while as part of a scheduled
git gc. In that case it is not always convenient to change the calling script.
The extended gc will check for the existence of the following files:
- `gc-aggressive`
- `gc-aggressive-once`
In the latter case, the file will be deleted, effectively causing an aggressive
gc just once.