Update multi-site DESIGN.md
The overall design of the multi-site plugin has been left
behind compared to the current status of the project.
Update the current stage of the implementation and the next
diff --git a/DESIGN.md b/DESIGN.md
index 270b7af..1bbe78f 100644
@@ -95,8 +95,8 @@
5. 2x masters (active RW/active RO) / single location - separate disks
6. 2x masters (active RW/active RO) / active + disaster recovery location
7. 2x masters (active RW/active RO) / two locations
-8. 2x masters (active RW/active RW) sharded / two locations
-9. 3x masters (active RW/active RW) sharded with auto-election / two locations
+8. 2x masters (active RW/active RW) / two locations
+9. 2 or more masters (active RW/active RW) sharded across 2 or more locations
10. Multiple masters (active RW/active RW) with quorum / multiple locations
The transition between steps requires not only an evolution of the Gerrit
@@ -109,43 +109,104 @@
Google is currently running at Stage #10. Qualcomm is at Stage #4 with the
difference that both masters are serving RW traffic, which is possible because the specifics
of their underlying storage, NFS and JGit implementation allows concurrent
-locking at the filesystem level.
+locking at the filesystem level. GerritHub is running at Stage #9, with 3 locations.
-## TODO: Synchronous replication
-Consider also synchronous replication for cases like 5, 6, 7... in which
-cases a write operation is only accepted if it is synchronously replicated to the
-other master node(s). This would provide 100% loss-less disaster recovery support. Without
-synchronous replication, when the RW master crashes, losing data, there could
-be no way to recover missed replications without soliciting users who pushed the commits
-in the first place to push them again. Further, with synchronous replication
-the RW site has to "degrade" to RO mode when the other node is not reachable and
-synchronous replications are not possible.
+## Projects sharding
-We must re-evaluate the useability of the replication plugin for supporting
-synchronous replication. For example, the replicationDelay doesn't make much
-sense in the synchronous case. Further, the rescheduling of a replication due
-to an in-flight push to the same remote URI also doesn't make much sense as we
-want the replication to happen immediately. Further, if the ref-update of the
-incoming push request has to be blocked until the synchronous replication
-finishes, the replication plugin cannot even start a replication as there is no
-ref-updated event yet. We may consider implementing the synchronous
-replication on a lower level. For example have an "pack-received" event and
-then simply forward that pack file to the other site. Similarly for the
-ref-updated events, instead of a real git push, we could just forward the
-ref-updates to the other site.
+Having all the repositories replicated to all sites could be, in some cases, not
+a great idea. The rationale can be explained with a simple example.
+### The Tango secret project
+Company FooCompany is developing a new huge and secret project code-named Tango
+with a software engineering team all geo-located in India.
+The Git repository is huge and contains millions of refs and packfiles for tens
+of GBytes. Project Tango requires also to have some medium-sizes binaries in the
+FooCompany has a multi-site deployment across the globe, covering Europe, USA,
+Australia and China, other than India, where the new project is developed.
+The teams in Europe and USA are involved in the project, from a code-review perspective.
+Their engineers are typically using the Gerrit UI for reviews and fetch individual
+patch-sets for local verification.
+### Tango secret project, without sharding
+All projects are replicated everywhere, including the Tango project.
+The replication creates a huge network overload across the globe.
+When an engineer is pushing a packfile in India, it gets replicated to all sites,
+causing congestion on the replication channel.
+When a software engineer in Europe reviews the changes of the Tango project, it
+creates modifications to the NoteDb meta ref that would be then replicated back
+to India with a non-neglibigle latency, due to the size of the repository and
+the huge refs advertisement phase implied in the replication.
+Software engineers around the globe do not need to see the Tango project, with
+the exception of the reviewers in Europe and USA. However, everyone is impacted
+and the servers and replication channels are overloaded.
+### Tango secret local project, with sharding
+The multi-site setup is using a sharding logic, projects are replicated
+or not depending on how they are classified:
+1. Global projects: category of projects that need to be always replicated to
+ all sites. (Example: All-Projects and All-Users)
+2. Local projects: category of projects that may not be replicated to
+ all sites. (Example: the Tango project mentioned above)
+The Tango project is a _local project_ because it is mainly developed in one
+When an engineer is pushing a packfile in India, it does not get replicated to
+all sites, saving bandwidth for the global projects replication.
+When a software engineer in Europe opens a change associated with the Tango project,
+he gets silently redirected to the site in India where the project is located.
+All sessions are broadcasted across the sites, so he does not realise that he is
+across all sites: the only thing that he may notice is a slight delay in the underlying
+REST requests made by his browser.
+Reviewers commenting on changes of the Tango project, create modifications to the NoteDb
+in India, which are immediately visible to the local software engineers, without
+a long replication lag.
+Software engineers around the globe do not need to see the Tango project, with
+the exception of the reviewers in Europe and USA. The Tango project is not visible
+and not replicated to the other sites and, the people not involved in the project,
+are not impacted at all.
+## Pull replication, synchronous or asynchronous
+Consider also pull replication for cases like 5, 6, 7... which could be done
+also synchronously to the incoming write operation.
+In case a write operation fails to be replicated by the master node(s), it could be
+automatically rolled back and reported to the client for retry.
+This would provide 100% loss-less disaster recovery support.
+When running pull replication asynchronously, similarly to the replication plugin,
+an unrecoverable crash of the replication source would result in unnoticed data loss.
+The only way to recover the data would be telling the users who pushed the commits
+to push them again. However, someone needs to manually detect the issue in the
+replication log and get in touch with the user.
+The [pull-replication plugin](https://gerrit.googlesource.com/plugins/pull-replication)
+supports synchronous replication and has the structure to perform also the
+asynchronous variant in the future.
## History and maturity level of the multi-site plugin
This plugin expands upon the excellent work on the high-availability plugin,
introduced by Ericsson for implementing mutli-master at Stage #4. The git log history
of this projects still shows the 'branching point' where it started.
+The v2.16.x (with NoteDb) of the multi-site plugin was at Stage #7.
-The current version of the multi-site plugin is at Stage #7, which is a pretty
-advanced stage in the Gerrit multi-master/multi-site configuration.
-Thanks to the multi-site plugin, it is now possible for Gerrit data to be
-available in two separate geo-locations (e.g. San Francisco and Bangalore),
-each serving local traffic through the local instances with minimum latency.
+The current version of the multi-site plugin is at Stage #9, it is now possible for
+Gerrit data to be available in two or more separate geo-locations
+(e.g. San Francisco, Frankfurt and Bangalore), each serving local traffic through
+the local instances with minimum latency.
### Why another plugin from a high availability fork?
@@ -161,60 +222,73 @@
multi-site, allows us to have a simpler, more usable experience, both for developers
of the plugin and for the Gerrit administrators using it.
+The high-availability and multi-site plugins are solutions to different problems.
+Two or more nodes on the same site are typically deployed to increase
+the reliability and scalability of a Gerrit setup, however, doesn't provide any
+benefit in terms of data access across locations. Replicating the repositories
+to remote locations does not help the scalability of a Gerrit setup but is more
+focused on reducing the data transfer time between the client and the server, thanks
+to the higher bandwidth available in the local regions.
-There are some advantages in implementing multi-site at Stage #7:
+There are some advantages in implementing multi-site at Stage #9:
-- Optimal latency of the read-only operations on both sites, which constitutes around 90%
- of the Gerrit traffic overall.
+- Optimal latency of the Git read/write operations on all sites, and signficant
+ improvement of the Gerrit UI responsiveness, thanks fo the reduction of the
+ network latency.
- High SLA (99.99% or higher, source: GerritHub.io) can be achieved by
- implementing both high availability inside each local site, and automatic
- catastrophic failover between the two sites.
+ implementing network distribution across sites.
-- Access transparency through a single Gerrit URL entry-point.
+- Access transparency through a single Gerrit URL, thanks to a geo-location DNS
-- Automatic failover, disaster recovery, and leader re-election.
+- Automatic failover, disaster recovery, and failover to remote sites.
-- The two sites have local consistency, with eventual consistency globally.
+- All sites have local consistency, with the assurance of global eventual
-The current limitations of Stage #7 are:
+The current limitations of Stage #9 are:
-- **Single RW site**: Only the RW site can accept modifications on the
- Git repositories or the review data.
+- **Limited supports for many sites**:
+ One could, potentially, support a very high number of sites, but the pull-replication
+ logic to all sites could have a serious consequence in the overall perceived latency.
+ Having to deal with a very high number of site requires the implementation of a quorum on
+ all the nodes available for replication.
-- **Supports only two sites**:
- One could, potentially, support more sites, but the configuration
- and maintenance efforts are more than linear to the number of nodes.
-- **Single point of failure:** The switch between the RO to RW sites is managed by a unique decision point.
-- **Lack of transactionality**:
- Data written to one site is acknowledged before its replication to the other location.
-- **Requires Gerrit v2.16 or later**: Data conisistency requires a server completely based on NoteDb.
+- **Requires Gerrit v3.0 or later**: Data conisistency requires a server completely
+ based on NoteDb.
If you are not familiar with NoteDb, please read the relevant
- [section in the Gerrit documentation](https://gerrit-documentation.storage.googleapis.com/Documentation/2.16.5/note-db.html).
+ [section in the Gerrit documentation](https://gerrit-documentation.storage.googleapis.com/Documentation/3.0.12/note-db.html).
### Example of multi-site operations
-Let's suppose the RW site is San Francisco and the RO site Bangalore. The
-modifications of data will always come to San Francisco and flow to Bangalore
-with a latency that can be between seconds and minutes, depending on
-the network infrastructure between the two sites. A developer located in
-Bangalore will always see a "snapshot in the past" of the data, both from the
-Gerrit UI and on the Git repository served locally. In contrast, a developer located in
-San Francisco will always see the "latest and greatest" of everything.
+Let's suppose you have two sites, in San Francisco and Bangalore. The
+modifications of data will flow from San Francisco to Bangalore and the other way round.
+Depending on the network infrastructure between the two sites latency can range
+between seconds and minutes. The available bandwith is low, so the Gerrit admin
+decides to use a traditional push replication (asynchronous) between the two sites.
+When a developer located in Bangalore accesses a repository for which most pushes
+originate from San Francisco, he may see a "snapshot in the past" of the data,
+both from the Gerrit UI and on the Git repository served locally.
+In contrast, a developer located in San Francisco will always see on his repository
+the "latest and greatest" of everything.
+Things are exactly in the other way around for a repository that is mainly
+receiving pushes from developers in Bangalore.
Should the central site in San Francisco become unavailable for a
-significant period of time, the Bangalore site will take over as the RW Gerrit
-site. The roles will then be inverted.
-People in San Francisco will be served remotely by the
+significant period of time, the Bangalore site will still be able to serve all
+Gerrit repositories, including those where most pushes come from San Francisco.
+People in San Francisco can't access their local site anymore, because it is
+unavailable. All the Git and Gerrit UI requests will be served remotely by the
Bangalore server while the local system is down. When the San Francisco site
-returns to service, and passes the "necessary checks", it will be re-elected as the
-main RW site.
+comes up again, and passes the "necessary checks", it
+will become the main site again for the users in the same geo location..
# Plugin design
@@ -249,7 +323,7 @@
Sessions are stored by default on the local filesystem in an H2 table but can
be externalized via plugins, like the WebSession Flatfile.
-To achieve a Stage #7 multi-site configuration, all the above information must
+To achieve a Stage #9 multi-site configuration, all the above information must
be replicated transparently across sites.
## High-level architecture
@@ -270,11 +344,14 @@
When no specific implementation is provided, then the [Global Ref-DB Noop implementation](#global-ref-db-noop-implementation)
then libModule interfaces are mapped to internal no-ops implementations.
-- **replication plugin**: enables the replication of the _Git repositories_ across
+- **replication plugin**: enables asynchronous push replication of the _Git repositories_
+ across sites.
-- **web-session flat file plugin**: supports the storage of _active sessions_
- to an external file that can be shared and synchronized across sites.
+- **pull replication plugin**: enables the synchronous replication of the _Git repositories_
+ across sites.
+- **web-session broker plugin**: supports the storage of _active sessions_
+ to a message broker topic, which is then broadcasted across sites.
- **health check plugin**: supports the automatic election of the RW site based
on a number of underlying conditions of the data and the systems.
@@ -288,6 +365,7 @@
## Implementation Details
### Multi-site libModule
As mentioned earlier there are different components behind the overarching architecture
of this solution of a distributed multi-site gerrit installation, each one fulfilling
a specific goal. However, whilst the goal of each component is well-defined, the
@@ -355,6 +433,7 @@
etcd, MySQL, Mongo, etc.
#### Global Ref-DB Noop implementation
The default `Noop` implementation provided by the `Multi-site` libModule accepts
any refs without checking for consistency. This is useful for setting up a test environment
and allows multi-site library to be installed independently from any additional
@@ -566,33 +645,18 @@
# Next steps in the roadmap
-## Step-1: Fill the gaps in multi-site Stage #7 implementation:
-- **Detection of a stale site**: The health check plugin has no awareness that one
- site that can be "too outdated" because it is still technically "healthy." A
- stale site needs to be put outside the balancing and all traffic needs to go
- to the more up-to-date site.
-- **Web session replication**: This currently must be implemented at the filesystem level
- using rsync across sites. This is problematic because of the delay it
- introduces. Should a site fail, some of the users may lose their sessions
- because the rsync was not executed yet.
-- **Index rebuild in case of broker failure**: In the case of a catastrophic
- failure at the broker level, the indexes of the two sites will be out of
- sync. A mechanism is needed to recover the situation
- without requiring the reindex of both sites offline, since that could take
- as much as days for huge installations.
-- **Git/SSH redirection**: Local users who rely on Git/SSH protocol are not able
- to use the local site for serving their requests, because HAProxy is not
- able to differentiate the type of traffic and, thus, is forced always to use the
- RW site, even though the operation is RO.
-## Step-2: Move to multi-site Stage #8.
+## Move to multi-site Stage #10.
- Auto-reconfigure HAProxy rules based on the projects sharding policy
-- Serve RW/RW traffic based on the project name/ref-name.
+- Implement more global-refdb storage layers (e.g. TiKV) and more cloud-native
+ message brokers (e.g. NATS)
-- Balance traffic with "locally-aware" policies based on historical data
+- Implement a quorum-based policy for accepting or rejecting changes in the pull-replication
+- Allow asynchronous pull-replication across sites, based on asynchronous events through
+ the message broker
+- Implement a "fast replication path" for NoteDb-only changes, instead of relying on the
+ Git protocol
\ No newline at end of file