Documentation/roadmap.md - k8s-gerrit - Git at Google

 # Roadmap

 ## General

 ### Planned features

 - **Automated verification process**: Run tests automatically to verify changes. \
   \
   Most tests in the project require a Kubernetes cluster and some additional
   prerequisites, e.g. istio. Currently, the Gerrit OpenSOurce community does not
   have these resources. At SAP, we plan to run verification in our internal systems,
   which won't be publicly viewable, but could already vote. Builds would only
   be triggered, if a maintainer votes `+1` on the `Build-Approved`-label. \
   \
   Builds can be moved to a public CI at a later point in time.

 - **Automated publishing of container images**: Publishing container images will
   happen automatically on ref-updated using a CI.

 - **Support for multiple Gerrit versions**: All currently supported Gerrit versions
   will also be supported in k8s-gerrit. \
   \
   Currently, container images used by this project are only published for a single
   Gerrit version, which is updated on an irregular schedule. Introducing stable
   branches for each gerrit version will allow to maintain container images for
   multiple Gerrit versions. Gerrit binaries will be updated with each official
   release and more frequently on `master`. This will be (at least partially)
   automated.

 - **Integration test suite**: A test suite that can be used to test a GerritCluster. \
   \
   A GerritCluster running in a Kubernetes cluster consists of multiple components.
   Having a suite of automated tests would greatly help to verify deployments in
   development landscapes before going productive.

 ## Gerrit Operator

 ### Version 1.0

 #### Implemented features

 - **High-availability**: Primary Gerrit StatefulSets will have limited support for
   horizontal scaling. \
   \
   Scaling has been enabled using the [high-availability plugin](https://gerrit.googlesource.com/plugins/high-availability/).
   Primary Gerrits will run in Active/Active configuration. Currently, two primary
   Gerrit instances, i.e. 2 pods in a StatefulSet, are supported

 - **Global RefDB support**: Global RefDB is required for Active/Active configurations
   of multiple primary Gerrits. \
   \
   The [Global RefDB](https://gerrit.googlesource.com/modules/global-refdb) support
   is required for high-availability as described in the previous point. The
   Gerrit Operator automatically sets up Gerrit to use a Global RefDB
   implementation. The following implementations are supported:
   - [spanner-refdb](https://gerrit.googlesource.com/plugins/spanner-refdb)
   - [zookeeper-refdb](https://gerrit.googlesource.com/plugins/zookeeper-refdb)

   \
   The Gerrit Operator does not set up the database used for the Global RefDB. It
   does however manage plugin/module installation and configuration in Gerrit.

 - **Full support for Nginx**: The integration of Ingresses managed by the Nginx
   ingress controller now supports automated routing. \
   \
   Instead of requiring users to use different subdomains for the different Gerrit
   deployments in the GerritCluster, requests are now automatically routed to the
   respective deployments. SSH has still to be set up manually, since this requires
   setting up the routing in the Nginx ingress controller itself.

 #### Planned features

 - **Versioning of CRDs**: Provide migration paths between API changes in CRDs. \
   \
   At the moment updates to the CRD are done without providing a migration path.
   This means a complete reinstallation of CRDS, Operator, CRs and dependent resources
   is required. This is not acceptable in a productive environment. Thus,
   the operator will always support the last two versions of each CRD, if applicable,
   and provide a migration path between those versions.

 - **Log collection**: Support addition of sidecar running a log collection agent
   to send logs of all components to some logging stack. \
   \
   Planned supported log collectors:
   - [OpenTelemetry agent](https://opentelemetry.io/docs/collector/deployment/agent/)
   - Option to add a custom sidecar

 - **Support for additional Ingress controllers**: Add support for setting up routing
   configurations for additional Ingress controllers \
   \
   Additional ingress controllers might include:
   - [Ambassador](https://www.getambassador.io/products/edge-stack/api-gateway)

 ### Version 1.x

 #### Potential features

 - **Support for additional log collection agents**: \
   \
   Additional log collection agents might include:
   - fluentbit
   - Option to add a custom sidecar

 - **Additional ValidationWebhooks**: Proactively avoid unsupported configurations. \
   \
   ValidationWebhooks are already used to avoid accepting unsupported configurations,
   e.g. deploying more than one primary Gerrit CustomResource per GerritCluster.
   So far not all such cases are covered. Thus, the set of validations will be
   further expanded.

 - **Better test coverage**: More tests are required to find bugs earlier.

 - **Automated reload of plugins**: Reload plugins on configuration change. \
   \
   Configuration changes in plugins typically don't require a restart of Gerrit,
   but just to reload the plugin. To avoid unnecessary downtime of pods, the
   Gerrit Operator will only reload affected plugins and not restart all pods, if
   only the plugin's configuration changed.

 - **Externalized (re-)indexing**: Alleviate load caused by online reindexing. \
   \
   On large Gerrit sites online reindexing due to schema migrations `a)` or initialization `b)`
   of a new site might take up to weeks and use a lot of resources, which might
   cause performance issues. This is not acceptable in production. The current
   plan to solve this issue is to implement a separate Gerrit deployment (GerritIndexer)
   that is not exposed to clients and that takes over the task of online reindexing.
   The GerritIndexer will mount the same repositories and will share events via
   the high-availability plugin. However, it will access repositories in read-only
   mode. \
   This solves the above named scenarios as follows: \
   \
   a) **Schema migrations**: If a Gerrit update including a schema migration for
     an index is applied, the Gerrit instances serving clients will be configured
     to continue to use the old schema. Online reindexing will be disabled in
     those instances. The GerritIndexer will have online reindexing enabled and
     will start to build the new index version. As soon as it is finished, i.e.
     it could start to use the new index version as read index, it will make a
     copy of the new index and publish it, e.g. using a shared filesystem. A
     restart of the Gerrit instances serving other clients will be triggered.
     During this restart the new index will be copied into the site. Since there
     may have been updated index entries since the new index version was published
     indexing of entries updated in the meantime will be triggered. \
   \
   b) **Initialization of a new site**: If Gerrit is horizontally scaled, it will
     be started with an empty index, i.e. it has to build the complete index. To
     avoid this, the GerritIndexer deployment will continuously keep a copy of the
     indexes up-to-date. It will regularly be stopped and a copy of the index will
     be stored in a shared volume. This can be used as a base for new instances, which
     then only have to update index entries that were changed in the meantime.

 - **Autoscaling**: Automatically scale Gerrit deployments based on usage. \
   \
   Metrics like available workers in the thread pools could be used to decide to
   scale the Gerrit deployment horizontally. This would allow to dynamically adapt
   to the current load. This helps to save costs and resources.

 ### Version 2.0

 #### Potential features

 - **Multi region support**: Support setups that are distributed over multiple regions. \
   \
   Supporting Gerrit installations that are distributed over multiple regions would
   allow to serve clients all over the world without large differences in latency
   and would also improve availability and reduce the risks of data loss. \
   Such a setup could be achieved by using the [multi-site setup](https://gerrit.googlesource.com/plugins/multi-site/).

 - **Remove the dependency on shared storage**: Use completely independent sites
   instead of sharing a filesystem for some site components. \
   \
   NFS and other shared filesystems potentially might cause performance issues on
   larger Gerrit installations due to latencies. A potential solution might be
   to use the [multi-site setup](https://gerrit.googlesource.com/plugins/multi-site/)
   to separate the sites of all instances and to use events and replication to
   share the state

 - **Shared index**: Using an external centralized index, e.g. OpenSearch instead
   of x copies of a Lucene index. \
   \
   Maintaining x copies of an index, where x is the number of Gerrit instances in
   a gerritCluster, is unnecessarily expensive, since the same write transactions
   have to be potentially done x times. Using a single centralized index would
   resolve this issue.

 - **Shared cache**: Using an external centralized cache for all Gerrit instances. \
   \
   Using a single cache for all Gerrit instances will reduce the number of
   computations for each Gerrit instance, since not every instance will have to
   keep its own copy up-to-date.

 - **Sharding**: Shard a site based on repositories. \
   \
   Repositories served by a single GerritCluster might be quite diverse, e.g. ranging
   from a few kilobytes to several gigabytes or repositories seeing high traffic
   and other barely being fetched. It is not trivial to configure Gerrit to work
   optimally for all repositories. Being able to shard at least the Gerrit Replicas
   would help to optimally serve all repositories.

 ## Helm charts

 Only limited support is planned for the `gerrit` and `gerrit-replica` helm-charts
 as soon as the Gerrit Operator reaches version 1.0. The reason is that the double
 maintenance of all features would not be feasible with the current number of
 contributors. The Gerrit Operator will support all features that are provided by
 the helm charts. If community members would like to adopt maintainership of the
 helm-charts, this would be very much appreciated and the helm-charts could then
 continued to be supported.
	# Roadmap

	## General

	### Planned features

	- Automated verification process: Run tests automatically to verify changes. \
	\
	Most tests in the project require a Kubernetes cluster and some additional
	prerequisites, e.g. istio. Currently, the Gerrit OpenSOurce community does not
	have these resources. At SAP, we plan to run verification in our internal systems,
	which won't be publicly viewable, but could already vote. Builds would only
	be triggered, if a maintainer votes `+1` on the `Build-Approved`-label. \
	\
	Builds can be moved to a public CI at a later point in time.

	- Automated publishing of container images: Publishing container images will
	happen automatically on ref-updated using a CI.

	- Support for multiple Gerrit versions: All currently supported Gerrit versions
	will also be supported in k8s-gerrit. \
	\
	Currently, container images used by this project are only published for a single
	Gerrit version, which is updated on an irregular schedule. Introducing stable
	branches for each gerrit version will allow to maintain container images for
	multiple Gerrit versions. Gerrit binaries will be updated with each official
	release and more frequently on `master`. This will be (at least partially)
	automated.

	- Integration test suite: A test suite that can be used to test a GerritCluster. \
	\
	A GerritCluster running in a Kubernetes cluster consists of multiple components.
	Having a suite of automated tests would greatly help to verify deployments in
	development landscapes before going productive.

	## Gerrit Operator

	### Version 1.0

	#### Implemented features

	- High-availability: Primary Gerrit StatefulSets will have limited support for
	horizontal scaling. \
	\
	Scaling has been enabled using the [high-availability plugin](https://gerrit.googlesource.com/plugins/high-availability/).
	Primary Gerrits will run in Active/Active configuration. Currently, two primary
	Gerrit instances, i.e. 2 pods in a StatefulSet, are supported

	- Global RefDB support: Global RefDB is required for Active/Active configurations
	of multiple primary Gerrits. \
	\
	The [Global RefDB](https://gerrit.googlesource.com/modules/global-refdb) support
	is required for high-availability as described in the previous point. The
	Gerrit Operator automatically sets up Gerrit to use a Global RefDB
	implementation. The following implementations are supported:
	- [spanner-refdb](https://gerrit.googlesource.com/plugins/spanner-refdb)
	- [zookeeper-refdb](https://gerrit.googlesource.com/plugins/zookeeper-refdb)

	\
	The Gerrit Operator does not set up the database used for the Global RefDB. It
	does however manage plugin/module installation and configuration in Gerrit.

	- Full support for Nginx: The integration of Ingresses managed by the Nginx
	ingress controller now supports automated routing. \
	\
	Instead of requiring users to use different subdomains for the different Gerrit
	deployments in the GerritCluster, requests are now automatically routed to the
	respective deployments. SSH has still to be set up manually, since this requires
	setting up the routing in the Nginx ingress controller itself.

	#### Planned features

	- Versioning of CRDs: Provide migration paths between API changes in CRDs. \
	\
	At the moment updates to the CRD are done without providing a migration path.
	This means a complete reinstallation of CRDS, Operator, CRs and dependent resources
	is required. This is not acceptable in a productive environment. Thus,
	the operator will always support the last two versions of each CRD, if applicable,
	and provide a migration path between those versions.

	- Log collection: Support addition of sidecar running a log collection agent
	to send logs of all components to some logging stack. \
	\
	Planned supported log collectors:
	- [OpenTelemetry agent](https://opentelemetry.io/docs/collector/deployment/agent/)
	- Option to add a custom sidecar

	- Support for additional Ingress controllers: Add support for setting up routing
	configurations for additional Ingress controllers \
	\
	Additional ingress controllers might include:
	- [Ambassador](https://www.getambassador.io/products/edge-stack/api-gateway)

	### Version 1.x

	#### Potential features

	- Support for additional log collection agents: \
	\
	Additional log collection agents might include:
	- fluentbit
	- Option to add a custom sidecar

	- Additional ValidationWebhooks: Proactively avoid unsupported configurations. \
	\
	ValidationWebhooks are already used to avoid accepting unsupported configurations,
	e.g. deploying more than one primary Gerrit CustomResource per GerritCluster.
	So far not all such cases are covered. Thus, the set of validations will be
	further expanded.

	- Better test coverage: More tests are required to find bugs earlier.

	- Automated reload of plugins: Reload plugins on configuration change. \
	\
	Configuration changes in plugins typically don't require a restart of Gerrit,
	but just to reload the plugin. To avoid unnecessary downtime of pods, the
	Gerrit Operator will only reload affected plugins and not restart all pods, if
	only the plugin's configuration changed.

	- Externalized (re-)indexing: Alleviate load caused by online reindexing. \
	\
	On large Gerrit sites online reindexing due to schema migrations `a)` or initialization `b)`
	of a new site might take up to weeks and use a lot of resources, which might
	cause performance issues. This is not acceptable in production. The current
	plan to solve this issue is to implement a separate Gerrit deployment (GerritIndexer)
	that is not exposed to clients and that takes over the task of online reindexing.
	The GerritIndexer will mount the same repositories and will share events via
	the high-availability plugin. However, it will access repositories in read-only
	mode. \
	This solves the above named scenarios as follows: \
	\
	a) Schema migrations: If a Gerrit update including a schema migration for
	an index is applied, the Gerrit instances serving clients will be configured
	to continue to use the old schema. Online reindexing will be disabled in
	those instances. The GerritIndexer will have online reindexing enabled and
	will start to build the new index version. As soon as it is finished, i.e.
	it could start to use the new index version as read index, it will make a
	copy of the new index and publish it, e.g. using a shared filesystem. A
	restart of the Gerrit instances serving other clients will be triggered.
	During this restart the new index will be copied into the site. Since there
	may have been updated index entries since the new index version was published
	indexing of entries updated in the meantime will be triggered. \
	\
	b) Initialization of a new site: If Gerrit is horizontally scaled, it will
	be started with an empty index, i.e. it has to build the complete index. To
	avoid this, the GerritIndexer deployment will continuously keep a copy of the
	indexes up-to-date. It will regularly be stopped and a copy of the index will
	be stored in a shared volume. This can be used as a base for new instances, which
	then only have to update index entries that were changed in the meantime.

	- Autoscaling: Automatically scale Gerrit deployments based on usage. \
	\
	Metrics like available workers in the thread pools could be used to decide to
	scale the Gerrit deployment horizontally. This would allow to dynamically adapt
	to the current load. This helps to save costs and resources.

	### Version 2.0

	#### Potential features

	- Multi region support: Support setups that are distributed over multiple regions. \
	\
	Supporting Gerrit installations that are distributed over multiple regions would
	allow to serve clients all over the world without large differences in latency
	and would also improve availability and reduce the risks of data loss. \
	Such a setup could be achieved by using the [multi-site setup](https://gerrit.googlesource.com/plugins/multi-site/).

	- Remove the dependency on shared storage: Use completely independent sites
	instead of sharing a filesystem for some site components. \
	\
	NFS and other shared filesystems potentially might cause performance issues on
	larger Gerrit installations due to latencies. A potential solution might be
	to use the [multi-site setup](https://gerrit.googlesource.com/plugins/multi-site/)
	to separate the sites of all instances and to use events and replication to
	share the state

	- Shared index: Using an external centralized index, e.g. OpenSearch instead
	of x copies of a Lucene index. \
	\
	Maintaining x copies of an index, where x is the number of Gerrit instances in
	a gerritCluster, is unnecessarily expensive, since the same write transactions
	have to be potentially done x times. Using a single centralized index would
	resolve this issue.

	- Shared cache: Using an external centralized cache for all Gerrit instances. \
	\
	Using a single cache for all Gerrit instances will reduce the number of
	computations for each Gerrit instance, since not every instance will have to
	keep its own copy up-to-date.

	- Sharding: Shard a site based on repositories. \
	\
	Repositories served by a single GerritCluster might be quite diverse, e.g. ranging
	from a few kilobytes to several gigabytes or repositories seeing high traffic
	and other barely being fetched. It is not trivial to configure Gerrit to work
	optimally for all repositories. Being able to shard at least the Gerrit Replicas
	would help to optimally serve all repositories.

	## Helm charts

	Only limited support is planned for the `gerrit` and `gerrit-replica` helm-charts
	as soon as the Gerrit Operator reaches version 1.0. The reason is that the double
	maintenance of all features would not be feasible with the current number of
	contributors. The Gerrit Operator will support all features that are provided by
	the helm charts. If community members would like to adopt maintainership of the
	helm-charts, this would be very much appreciated and the helm-charts could then
	continued to be supported.