This document collects and organizes thoughts about the design of the Gerrit multi-site plugin, supporting the definition of the implementation roadmap.
It first presents background for the problems the plugin will address and the tools currently available in the Gerrit ecosystem that support the solution. It then lays out an overall roadmap for implementing support for Gerrit multi-site, and a snapshot of the current status of the design including associated limitations and constraints.
Companies that adopt Gerrit as the center of their development and review pipeline often have the requirement to be available on a 24/7 basis. This requirement may extend across large and geographically distributed teams in different continents.
Because of constraints defined by the CAP theorem designing a performant and scalable solution is a real challenge.
Vertical scaling is one of the options to support a high load and a large number of users. A powerful server with multiple cores and sufficient RAM to potentially fit the most frequently used repositories simplifies the design and implementation of the system. The relatively reasonable cost of hardware and availability of multi-core CPUs make this solution highly attractive to some large Gerrit setups. Further, the central Gerrit server can be duplicated with an active/passive or active/active high availability configuration with the storage of the Git repositories shared across nodes through dedicated fibre-channel lines or SANs.
This approach can be suitable for mid to large-sized Gerrit installations where teams are co-located or connected via high-speed dedicated networks. However, when teams are located on opposite sides of the planet, even at the speed of light the highest theoretical fire-channel direct connection can be limiting. For example, from San Francisco to Bangalore the theoretical absolute minimum latency is 50 msec. In practice, however, it is often around 150/200 msec in the best case scenarios.
In the alternate option, horizontal scaling, the workload is spread across several nodes, which are distributed to different locations. For our teams in San Francisco and Bangalore, each accesses a set of Gerrit masters located closer to their geographical location, with higher bandwidth and lower latency. (To control operational cost from the proliferation of servers, the number of Gerrit masters can be scaled up and down on demand.)
This solution offers a higher level of scalability and lower latency across locations, but it requires a more complex design.
The vertical and horizontal approaches can be combined to achieve both high performance on the same location and low latency across geographically distributed sites.
Geographical locations with larger teams and projects can have a bigger Gerrit server in a high availability configuration, while locations with less critical service levels can use a lower-spec setup.
The multi-site plugin enables the OpenSource version of Gerrit Code Review to support horizontal scalability across sites.
Gerrit has already been deployed in a multi-site configuration at Google and in a multi-master fashion at Qualcomm. Both implementations include fixes and extensions that are tailored to the specific infrastructure requirements of each company‘s global networks. Those solutions may or may not be shared with the rest of the OpenSource Community. Specifically, Google’s deployment is proprietary and not suitable for any environment outside Google‘s data-centers. Further, in Qualcomm’s case, their version of Gerrit is a fork of v2.7.
In contrast, the multi-site plugin is based on standard OpenSource components and is deployed on a standard cloud environment. It is currently used in a multi- master and multi-site deployment on GerritHub.io, serving two continents (Europe and North America) in a high availability setup on each site.
The development of the multi-site support for Gerrit is complex and thus has been deliberately broken down into incremental steps. The starting point is a single Gerrit master deployment, and the end goal is a fully distributed set of cooperating Gerrit masters across the globe.
The transition between steps requires not only an evolution of the Gerrit setup and the set of plugins but also the implementation of more mature methods to provision, maintain and version servers across the network. Qualcomm has pointed out that the evolution of the company culture and the ability to consistently version and provision the different server environments are winning features of their multi-master setup.
Google is currently running at Stage #10. Qualcomm is at Stage #4 with the difference that both masters are serving RW traffic, which is possible because the specifics of their underlying storage, NFS and JGit implementation allows concurrent locking at the filesystem level.
Consider also synchronous replication for cases like 5, 6, 7... in which cases a write operation is only accepted if it is synchronously replicated to the other master node(s). This would provide 100% loss-less disaster recovery support. Without synchronous replication, when the RW master crashes, losing data, there could be no way to recover missed replications without soliciting users who pushed the commits in the first place to push them again. Further, with synchronous replication the RW site has to “degrade” to RO mode when the other node is not reachable and synchronous replications are not possible.
We must re-evaluate the useability of the replication plugin for supporting synchronous replication. For example, the replicationDelay doesn‘t make much sense in the synchronous case. Further, the rescheduling of a replication due to an in-flight push to the same remote URI also doesn’t make much sense as we want the replication to happen immediately. Further, if the ref-update of the incoming push request has to be blocked until the synchronous replication finishes, the replication plugin cannot even start a replication as there is no ref-updated event yet. We may consider implementing the synchronous replication on a lower level. For example have an “pack-received” event and then simply forward that pack file to the other site. Similarly for the ref-updated events, instead of a real git push, we could just forward the ref-updates to the other site.
This plugin expands upon the excellent work on the high-availability plugin, introduced by Ericsson for implementing mutli-master at Stage #4. The git log history of this projects still shows the ‘branching point’ where it started.
The current version of the multi-site plugin is at Stage #7, which is a pretty advanced stage in the Gerrit multi-master/multi-site configuration.
Thanks to the multi-site plugin, it is now possible for Gerrit data to be available in two separate geo-locations (e.g. San Francisco and Bangalore), each serving local traffic through the local instances with minimum latency.
You may be questioning the reasoning behind creating yet another plugin for multi-master, instead of maintaining a single code-base with the high-availability plugin. The choice stems from the differing design considerations to address scalabiilty, as discussed above for the vertical (single site) and horizonal (multi-site) approaches.
In theory, one could keep a single code-base to manage both approaches, however the result would be very complicated and difficult to configure and install. Having two more focussed plugins, one for high availability and another for multi-site, allows us to have a simpler, more usable experience, both for developers of the plugin and for the Gerrit administrators using it.
There are some advantages in implementing multi-site at Stage #7:
Optimal latency of the read-only operations on both sites, which constitutes around 90% of the Gerrit traffic overall.
High SLA (99.99% or higher, source: GerritHub.io) can be achieved by implementing both high availability inside each local site, and automatic catastrophic failover between the two sites.
Access transparency through a single Gerrit URL entry-point.
Automatic failover, disaster recovery, and leader re-election.
The two sites have local consistency, with eventual consistency globally.
The current limitations of Stage #7 are:
Single RW site: Only the RW site can accept modifications on the Git repositories or the review data.
Supports only two sites: One could, potentially, support more sites, but the configuration and maintenance efforts are more than linear to the number of nodes.
Single point of failure: The switch between the RO to RW sites is managed by a unique decision point.
Lack of transactionality: Data written to one site is acknowledged before its replication to the other location.
Requires Gerrit v2.16 or later: Data conisistency requires a server completely based on NoteDb. If you are not familiar with NoteDb, please read the relevant section in the Gerrit documentation.
Let's suppose the RW site is San Francisco and the RO site Bangalore. The modifications of data will always come to San Francisco and flow to Bangalore with a latency that can be between seconds and minutes, depending on the network infrastructure between the two sites. A developer located in Bangalore will always see a “snapshot in the past” of the data, both from the Gerrit UI and on the Git repository served locally. In contrast, a developer located in San Francisco will always see the “latest and greatest” of everything.
Should the central site in San Francisco become unavailable for a significant period of time, the Bangalore site will take over as the RW Gerrit site. The roles will then be inverted. People in San Francisco will be served remotely by the Bangalore server while the local system is down. When the San Francisco site returns to service, and passes the “necessary checks”, it will be re-elected as the main RW site.
This section goes into the high-level design of the current solution, lists the components involved, and describes how the components interact with each other.
There are several distinct classes of information that have to be kept consistent across different sites in order to guarantee seamless operation of the distributed system.
Git repositories: They are stored on disk and are the most important information to maintain. The repositories store the following data:
Git BLOBs, objects, refs and trees.
NoteDb, including Groups, Accounts and review data
Project configurations and ACLs
Project submit rules
Indexes: A series of secondary indexes to allow search and quick access to the Git repository data. Indexes are persistent across restarts.
Caches: A set of in-memory and persisted data designed to reduce CPU and disk utilization and to improve performance.
Web Sessions: Define an active user session with Gerrit, used to reduce load to the underlying authentication system. Sessions are stored by default on the local filesystem in an H2 table but can be externalized via plugins, like the WebSession Flatfile.
To achieve a Stage #7 multi-site configuration, all the above information must be replicated transparently across sites.
The multi-site solution described here depends upon the combined use of different components:
multi-site plugin: Enables the replication of Gerrit indexes, caches, and stream events across sites.
replication plugin: enables the replication of the Git repositories across sites.
web-session flat file plugin: supports the storage of active sessions to an external file that can be shared and synchronized across sites.
health check plugin: supports the automatic election of the RW site based on a number of underlying conditions of the data and the systems.
HA Proxy: provides the single entry-point to all Gerrit functionality across sites.
The interactions between these components are illustrated in the following diagram:
The multi-site plugin adopts an event-sourcing pattern and is based on an external message broker. The current implementation uses Apache Kafka. It is, however, potentially extensible to others, like RabbitMQ or NATS.
The replication of the Git repositories, indexes, cache and stream events happen on different channels and at different speeds. Git data is typically larger than meta-data and has higher latency than reindexing, cache evictions or stream events. This means that when someone pushes a new change to Gerrit on one site, the Git data (commits, BLOBs, trees, and refs) may arrive later than the associated reindexing or cache eviction events.
It is, therefore, necessary to handle the lack of synchronization of those channels in the multi-site plugin and reconcile the events at the destinations.
The solution adopted by the multi-site plugin supports eventual consistency at rest at the data level, thanks to the following two components which:
Identify not-yet-processable events: A mechanism to recognize not-yet-processable events related to data not yet available (based on the timestamp information available on both the metadata update and the data event)
Queue not-yet-processable events: A queue of not-yet-processable events and an asynchronous processor to check if they became processable. The system also is configured to discard events that have been in the queue for too long.
Stream events are wrapped into an event header containing a source identifier. Events originated by the same node in the broker-based channel are silently dropped so that they do not replicate multiple times.
Gerrit has the concept of server-id which, unfortunately, does not help solve this problem because all the nodes in a Gerrit cluster must have the same server-id to allow interoperability of the data stored in NoteDb.
The multi-site plugin introduces a new concept of instance-id, which is a UUID generated during startup and saved into the data folder of the Gerrit site. If the Gerrit site is cleared or removed, a new id is generated and the multi-site plugin will start consuming all events that have been previously produced.
The concept of the instance-id is very useful. Since other plugins could benefit from it, it will be the first candidate to move into the Gerrit core, generated and maintained with the rest of the configuration. Then it can be included in all stream events, at which time the multi-site plugin's “enveloping of events” will become redundant.
The broker based solutions improve the resilience and scalability of the system. But there is still a point of failure: the availability of the broker itself. However, using the broker does allow having a high-level of redundancy and a multi-master / multi-site configuration at the transport and storage level.
At the moment, the acknowledge level for publication can be controlled via configuration and allows to tune the QoS of the publication process. Failures are explicitly not handled at the moment; they are just logged as errors. There is no retry mechanism to handle temporary failures.
The current solution of multi-site at Stage #7 with asynchronous replication risks that the system will reach a Split Brain situation (see issue #10554).
In this case we are considering two different clients each doing a
push on top of the same reference. This could be a new commit in a branch or the change of an existing commit.
t0: both clients see the status of
Instance1 is the RW node and will receive any
Instance2 are in sync at
W1. The request is served by
Instance1 which acknowledges it and starts the replication process (with some delay).
t2: The replication operation is completed. Both instances are in a consistent state
W0 -> W1.
Client1 shares that state but
Client2 is still behind.
W2 which is still based on
W0 -> W2). The request is served by
Instance2 which detects that the client push operation was based on an out-of-date starting state for the ref. The operation is refused.
Client2 synchronises its local state (e.g. rebases its commit) and pushes
W0 -> W1 -> W2. That operation is now considered valid, acknowledged and put in the replication queue until
Instance1 becomes available.
Instance1 restarts and is replicated at
W0 -> W1 -> W2
In this case the steps are very similar except that
Instance1 fails after acknowledging the push of
W0 -> W1 but before having replicated the status to
W0 -> W2 to
Instance2, this is considered a valid operation. It gets acknowledged and inserted in the replication queue.
Instance1 restarts. At this point both instances have pending replication operations. They are executed in parallel and they bring the system to divergence.
Root causes of the Split Brain problem:
pushoperation before all replicas are fully in sync.
Two possible approaches to solve the Split Brain problem:
Synchronous replication: In this case the system would behave essentially as the happy path diagram shown above and would solve the problem by operating on the first of the causes, at the expense of performance, availability and scalability. It is a viable and simple solution for two nodes set up with an infrastructure allowing fast replication.
Centralise the information about the latest status of mutable refs: This would operate on the second cause. That is, it would allow instances to realise that they are not in sync on a particular ref and refuse any write operation on that ref. The system could operate normally on any other ref and also would impose no limitation in other functions such as, Serving the GUI, supporting reads, accepting new changes or patch-sets on existing changes. This option is discussed in further detail below.
NOTE: The two options are not exclusive.
An implementation of the out-of-sync detection logic could be based on a central coordinator holding the last known status of a mutable ref (immutable refs won't have to be stored here). This would be, essentially, a DFS base
This component would:
shafor each specific
Zookeeper. (One implementation was done by Dave Borowitz some years ago.)
This interaction is illustrated in the diagram below:
The difference, in respect to the split brain use case, is that now, whenever a change of a mutable ref is requested, the Gerrit server verifies with the central RefDB that its status for this ref is consistent with the latest cluster status. If that is true the operation succeeds. The ref status is atomically compared and set to the new status to prevent race conditions.
In this case
Instance2 enters a Read Only mode for the specific branch until the replication from
Instance1 is completed successfully. At this point write operations on the reference can be recovered. If
Client2 can perform the
push again vs
Instance2, the server recognises that the client status needs an update, the client will
push the correct status.
NOTE: This implementation will prevent the cluster to enter split brain but might result in a set of refs in Read Only state across all the cluster if the RW node is failing after having sent the request to the Ref-DB but before persisting this request into its
Detection of a stale site: The health check plugin has no awareness that one site that can be “too outdated” because it is still technically “healthy.” A stale site needs to be put outside the balancing and all traffic needs to go to the more up-to-date site.
Web session replication: This currently must be implemented at the filesystem level using rsync across sites. This is problematic because of the delay it introduces. Should a site fail, some of the users may lose their sessions because the rsync was not executed yet.
Index rebuild in case of broker failure: In the case of a catastrophic failure at the broker level, the indexes of the two sites will be out of sync. A mechanism is needed to recover the situation without requiring the reindex of both sites offline, since that could take as much as days for huge installations.
Git/SSH redirection: Local users who rely on Git/SSH protocol are not able to use the local site for serving their requests, because HAProxy is not able to differentiate the type of traffic and, thus, is forced always to use the RW site, even though the operation is RO.
Support for different brokers: Currently, the multi-site plugin supports Kafka. More brokers need to be supported in a fashion similar to the ITS-* plugins framework. Explicit references to Kafka must be removed from the multi-site plugin. Other plugins may contribute implementations to the broker extension point.
Split the publishing and subscribing: Create two separate plugins. Combine the generation of the events into the current kafka- events plugin. The multi-site plugin will focus on consumption of, and sorting of, the replication issues.
Auto-reconfigure HAProxy rules based on the projects sharding policy
Serve RW/RW traffic based on the project name/ref-name.
Balance traffic with “locally-aware” policies based on historical data
Preventing split-brain in case of temporary sites isolation