Stop creating HealthCheckConfig with a dynamically injected plugin name

Currently, in order to create a `HealthCheckConfig` instance one needs
to pass the plugin name. The plugin name is injected dynamically when
the object is provisioned. The constructor will then create a config
object using a plugin config factory with the provided plugin name.

This in turn means the config object will hold the contents of the
`plugin.config` file. For any "core" healthcheck, ie a healthcheck
defined in the healthcheck plugin, this works fine, ie configuration is
retrieved from the `healthcheck.config` file.

The problem arises for any external check, ie a check defined in an
external plugin. In that case, the `pluginName` will be the one of the
external plugin. So for a plugin foo, the `HealthCheckConfig` object will
use config located in a file foo.config. Here the external plugin's
entire config is leaked into the healthcheck config object.

This is an even bigger problem because it is assumed every plugin
provides its config through a `pluginName.config` file, which is not
true. A prime example of this is the pull-replication plugin, its config
is defined in a `replication.config` file. Therefore, for such cases,
any configuration defined in the plugin's config will be silently
ignored.

A major impact of the above problems is that for external checks, we
can't have the "base" healthcheck config in the `healthcheck.config`
file. By "base" here we refer mainly to the `enabled` flag and the
timeout, both core features of any healthcheck specification.
Such configuration must be defined in the external plugin's config file.

Stop injecting dynamically the plugin name, instead hardcode it to the
healthcheck plugin's name. An additional benefit of this approach is
that the HealthCheckConfig will truly be a singleton object, as both
core and external healthchecks will use the same instance.

Bug: Issue 312895374
Change-Id: I445ceafb69c74bc60530f25b44aa80e09262c2a7
1 file changed
tree: bdc2c75e2a8c4f6cedc74d38797fb81fe513e201
  1. src/
  2. .gitignore
  3. BUILD
  4. LICENSE
  5. README.md
README.md

Plugin to verify the Gerrit health status

Allow having a single entry point to check the availability of the services that Gerrit exposes.

How to build

Clone or link this plugin to the plugins directory of Gerrit‘s source tree, and then run bazel build on the plugin’s directory.

Example:

git clone --recursive https://gerrit.googlesource.com/gerrit
git clone https://gerrit.googlesource.com/plugins/healthcheck
pushd gerrit/plugins && ln -s ../../healthcheck . && popd
cd gerrit && bazel build plugins/healthcheck

The output plugin jar is created in:

bazel-genfiles/plugins/healthcheck/healthcheck.jar

How to install

Copy the healthcheck.jar into the Gerrit's /plugins directory and wait for the plugin to be automatically loaded. The healthcheck plugin is compatible with both primary Gerrit setups and Gerrit replicas. The only difference to bear in mind is that some checks will be automatically disabled on replicas (e.g. query changes) because the associated subsystem is switched off.

How to use

The healthcheck plugin exposes a single endpoint under its root URL and provides a JSON output of the Gerrit health status.

The HTTP status code returned indicates whether Gerrit is healthy (HTTP status 200) or has some issues (HTTP status 500).

The HTTP response payload is a JSON output that contains the details of the checks performed.

  • ts: epoch timestamp in millis of the test
  • elapsed: elapsed time in millis to perform the check
  • querychanges: check that Gerrit can query changes
  • reviewdb: check that Gerrit can connect and query ReviewDb
  • projectslist: check that Gerrit can list projects
  • jgit: check that Gerrit can access repositories

Each check returns a JSON payload with the following information:

  • ts: epoch timestamp in millis of the individual check

  • elapsed: elapsed time in millis to complete the check

  • result: result of the health check

    • passed: the check passed successfully
    • disabled: the check was disabled
    • failed: the check failed with an error
    • timeout: the check took too long and timed out

Example of a healthy Gerrit response:

GET /config/server/healthcheck~status

200 OK
Content-Type: application/json

)]}'
{
  "ts": 139402910202,
  "elapsed": 100,
  "querychanges": {
    "ts": 139402910202,
    "elapsed": 20,
    "result": "passed"
  },
  "reviewdb": {
    "ts": 139402910202,
    "elapsed": 50,
    "result": "passed"
  },
  "projectslist": {
    "ts": 139402910202,
    "elapsed": 100,
    "result": "passed"
  },
  "jgit": {
    "ts": 139402910202,
    "elapsed": 80,
    "result": "passed"
  }
}

Example of a Gerrit instance with the projects list timing out:

GET /config/server/healthcheck~status

500 ERROR
Content-Type: application/json

)]}'
{
  "ts": 139402910202,
  "elapsed": 100,
  "querychanges": {
    "ts": 139402910202,
    "elapsed": 20,
    "result": "passed"
  },
  "reviewdb": {
    "ts": 139402910202,
    "elapsed": 50,
    "result": "passed"
  },
  "projectslist": {
    "ts": 139402910202,
    "elapsed": 100,
    "result": "timeout"
  },
  "jgit": {
    "ts": 139402910202,
    "elapsed": 80,
    "result": "passed"
  }
}

It's also possible to artificially make the healthcheck fail by placing a file at a configurable path specified like:

[healtcheck]
  failFileFlaPath="data/healthcheck/fail"

This will make the healthcheck endpoint return 500 even if the node is otherwise healthy. This is useful when a node needs to be removed from the pool of available Gerrit instance while it undergoes maintenance.

NOTE: If the path starts with / then even paths outside of Gerrit‘s home will be checked. If the path starts WITHOUT / then the path is relative to Gerrit’s home.

NOTE: The file needs to be a real file rather than a symlink.

Metrics

As for all other endpoints in Gerrit, some metrics are automatically emitted when the /config/server/healthcheck~status endpoint is hit (thanks to the Dropwizard library).

Some additional metrics are also produced to give extra insights on their result about results and latency of healthcheck sub component, such as jgit, reviewdb, etc.

More information can be found in the metrics.md file.