commit | bc3bd13d3ee8d85884e36bcaa79c183bde536449 | [log] [tgz] |
---|---|---|
author | Jacek Centkowski <geminica.programs@gmail.com> | Fri Dec 15 10:52:14 2023 +0100 |
committer | Jacek Centkowski <geminica.programs@gmail.com> | Thu Dec 21 09:42:41 2023 +0100 |
tree | 91f13f9b11d210bec7cdf8b84c416c741d2baa08 | |
parent | 9c01319e981f542d6f3de87afd0a67303080f8cf [diff] |
Implement changes indexes lock files check Detect the following failures of both open and closed changes indexes write.lock files: * write.lock file is missing * write.lock file is not writeable by Gerrit Notes: * so far only changes indexes are subject of check * in order to test it `UseLocalDisk` annotation has to be applied (and it is heavy operation) therefore `HealthCheckIT` was modified so that `changesindex` healthcheck is disabled for all test cases (or additionally disabled in case when some checks get disabled) * `changesindex` healthcheck test cases were added to a dedicated `ChangesIndexHealthCheckIT` file * common functions from `HealthCheckIT` and `ChangesIndexHealthCheckIT` were introduced in `AbstractHealthCheckIntegrationTest` The write.lock ownership change detection was tested manually with the following steps: * docker run --rm -p 8080:8080 -p 29418:29418 -ti --name gerrit_3.9.1 gerritcodereview/gerrit:3.9.1 * docker cp bazel-bin/plugins/healthcheck/healthcheck.jar gerrit_3.9.1:/var/gerrit/plugins * curl localhost:8080/config/server/healthcheck~status returned "changesindex": { "result": "passed", "ts": 1703147883563, "elapsed": 0 } * docker exec -u root -it gerrit_3.9.1 /bin/bash * chown root:root /var/gerrit/index/changes_0084/open/write.lock * curl localhost:8080/config/server/healthcheck~status returned "changesindex": { "result": "failed", "ts": 1703148022276, "elapsed": 0 } IOW it works as expected. Bug: Issue 40015289 Change-Id: I66a4053483e2c10eb28c150782c33cfd10dc2b15
Allow having a single entry point to check the availability of the services that Gerrit exposes.
Clone or link this plugin to the plugins directory of Gerrit‘s source tree, and then run bazel build on the plugin’s directory.
Example:
git clone --recursive https://gerrit.googlesource.com/gerrit git clone https://gerrit.googlesource.com/plugins/healthcheck pushd gerrit/plugins && ln -s ../../healthcheck . && popd cd gerrit && bazel build plugins/healthcheck
The output plugin jar is created in:
bazel-genfiles/plugins/healthcheck/healthcheck.jar
Copy the healthcheck.jar into the Gerrit's /plugins directory and wait for the plugin to be automatically loaded. The healthcheck plugin is compatible with both primary Gerrit setups and Gerrit replicas. The only difference to bear in mind is that some checks will be automatically disabled on replicas (e.g. query changes) because the associated subsystem is switched off.
The healthcheck plugin exposes a single endpoint under its root URL and provides a JSON output of the Gerrit health status.
The HTTP status code returned indicates whether Gerrit is healthy (HTTP status 200) or has some issues (HTTP status 500).
The HTTP response payload is a JSON output that contains the details of the checks performed.
Each check returns a JSON payload with the following information:
ts: epoch timestamp in millis of the individual check
elapsed: elapsed time in millis to complete the check
result: result of the health check
Example of a healthy Gerrit response:
GET /config/server/healthcheck~status 200 OK Content-Type: application/json )]}' { "ts": 139402910202, "elapsed": 100, "querychanges": { "ts": 139402910202, "elapsed": 20, "result": "passed" }, "reviewdb": { "ts": 139402910202, "elapsed": 50, "result": "passed" }, "projectslist": { "ts": 139402910202, "elapsed": 100, "result": "passed" }, "jgit": { "ts": 139402910202, "elapsed": 80, "result": "passed" } }
Example of a Gerrit instance with the projects list timing out:
GET /config/server/healthcheck~status 500 ERROR Content-Type: application/json )]}' { "ts": 139402910202, "elapsed": 100, "querychanges": { "ts": 139402910202, "elapsed": 20, "result": "passed" }, "reviewdb": { "ts": 139402910202, "elapsed": 50, "result": "passed" }, "projectslist": { "ts": 139402910202, "elapsed": 100, "result": "timeout" }, "jgit": { "ts": 139402910202, "elapsed": 80, "result": "passed" } }
It's also possible to artificially make the healthcheck fail by placing a file at a configurable path specified like:
[healtcheck] failFileFlaPath="data/healthcheck/fail"
This will make the healthcheck endpoint return 500 even if the node is otherwise healthy. This is useful when a node needs to be removed from the pool of available Gerrit instance while it undergoes maintenance.
NOTE: If the path starts with /
then even paths outside of Gerrit‘s home will be checked. If the path starts WITHOUT /
then the path is relative to Gerrit’s home.
NOTE: The file needs to be a real file rather than a symlink.
As for all other endpoints in Gerrit, some metrics are automatically emitted when the /config/server/healthcheck~status
endpoint is hit (thanks to the Dropwizard library).
Some additional metrics are also produced to give extra insights on their result about results and latency of healthcheck sub component, such as jgit, reviewdb, etc.
More information can be found in the metrics.md file.