| # Gerrit dual-primary in High-Availability |
| |
| This set of templates provides all the components to deploy a Gerrit dual-primary |
| in HA in ECS. The 2 primaries will share the Git repositories via NFS, using EFS. |
| |
| ## Architecture |
| |
| The following templates are provided in this example: |
| * `cf-cluster`: define the ECS cluster and the networking stack |
| * `cf-service-primary`: define the service stack running the gerrit primary |
| * `cf-dns-route`: define the DNS routing for the service |
| * `cf-service-replica`: define the service stack running the gerrit replica |
| * `cf-dashboard`: define the CloudWatch dashboard for the services |
| |
| When the recipe enables the replication_service (see [docs](#replication-service)) |
| then these additional templates will be executed: |
| |
| * `cf-service-replication`: Define a replication stack that will allow git replication |
| over the EFS volume, which is mounted by the primary instances. |
| |
| ### Data persistency |
| |
| * EBS volumes for: |
| * Indexes |
| * Caches |
| * Logs |
| * EFS volume: |
| * Share Git repositories between primaries |
| * Share Web sessions between primaries |
| |
| *NOTE*: This stack uses EFS in provisioned mode, which is a better setting for large repos |
| (> 1GB uncompressed) since it provides a lower latency compared to the burst mode. |
| However, it has some [costs associated](https://aws.amazon.com/efs/pricing/). |
| If you are dealing with small repos, you can switch to burst mode. |
| |
| #### Deploying using pre-existing data. |
| |
| Gerrit stores information in two volumes: git data (and possibly websessions, |
| when not using multi-site) are shared across Gerrit nodes and therefore |
| persisted in the EFS volume, whilst cache, logs, plugins data and indexes are |
| local to each specific Gerrit node and thus stored in the EBS volume. |
| |
| In order to deploy a Gerrit instance that runs on pre-existing data, the EFS |
| volume and an EBS snapshot need to be specified in the `setup.env` file(see |
| [configuration](#environment) for more information on how to do this). |
| |
| Referring to persistent volumes allows to perform [blue-green deployments](#bluegreen-deployment). |
| |
| ### Deployment type |
| |
| * Latest Gerrit version deployed using the official [Docker image](https://hub.docker.com/r/gerritcodereview/gerrit) |
| * Application deployed in ECS on a single EC2 instance |
| |
| #### Blue/Green deployment |
| |
| When a dual-primary stack is created, unless otherwise specified, a new EFS is |
| created and a two new empty EBSs are attached to primary1 and primary2, |
| respectively. |
| |
| In a [blue/green deployment](https://en.wikipedia.org/wiki/Blue-green_deployment) |
| scenario, this initial stack is called the *blue* stack. |
| |
| ```bash |
| make AWS_REGION=us-east-1 AWS_PREFIX=gerrit-blue create-all |
| ``` |
| |
| Later on (days, weeks, months), the need of a change arises, for which a new |
| version of the cluster needs to be deployed: this will be the _green_ stack and |
| it will need to be deployed as such: |
| |
| 1. Take primary1 EBS snapshot of volume attached to /dev/xvdg (note, this needs |
| to be done in a read only window). Ideally this step is already performed |
| regularly by a backup script. |
| |
| 2. Update the `setup.env` to point to existing volumes, for example: |
| |
| ```bash |
| PRIMARY_FILESYSTEM_ID=fs-93514727 |
| REPLICA_FILESYSTEM_ID=fs-9c514728 |
| GERRIT_VOLUME_SNAPSHOT_ID=snap-048a5c2dfc14a81eb |
| ``` |
| |
| If the network stack was created as part of this deployment (i.e. a new VPC was |
| created as part of this deployment), then you need to set network resources so |
| that the green stack can be deployed in the same VPC, for example: |
| |
| ```bash |
| VPC_ID= vpc-03292278512e783c7 |
| INTERNET_GATEWAY_ID=igw-0cb5b144c294f9411 |
| |
| SUBNET1_ID=subnet-066065ea55fda52cf |
| SUBNET1_AZ=us-east-1a |
| SUBNET1_CIDR=10.0.0.0/24 |
| |
| SUBNET2_ID=subnet-0fefe45d89ce02b31 |
| SUBNET2_AZ=us-east-1b |
| SUBNET2_CIDR=10.0.32.0/24 |
| ``` |
| |
| Note that if the refs-db dynamodb tables were created as part of the initial |
| stack (`CREATE_REFS_DB_TABLES` was set to `true`), you will need to explicitly |
| set it to `false` to avoid attempting to create the same tables again. |
| |
| 3. Deploy the *green* stack: |
| |
| ```bash |
| make AWS_REGION=us-east-1 AWS_PREFIX=gerrit-green create-all |
| ``` |
| |
| 4. Once the green stack comes up, Gerrit will start reindexing the changes |
| that have been created between the time the EBS snapshot was taken and now. |
| This will happen in background and might take some time depending on how old |
| the snapshot was. |
| |
| Once you are happy the green stack is aligned and healthy you can switch the |
| Route53 DNS to the new green stack. |
| |
| 5. You can leave the blue stack running as long as you want, so that you can |
| always rollback to it. Once ready you can delete the blue stack as follows: |
| |
| ```bash |
| make AWS_REGION=us-east-1 AWS_PREFIX=gerrit-blue delete-all |
| ``` |
| |
| Note that, even if the EFS resources were created as part of the blue stack, |
| they will be retained during the stack deletion, so that they can still be used |
| by the green stack. |
| |
| This includes EFS as well as VPC resources (if they were created as part of the |
| blue stack). |
| |
| ### Logging |
| |
| * All the logs are forwarded to AWS CloudWatch in the LogGroup with the cluster |
| stack name. Please refer to the general [logging documentation](../README.md#logging) |
| for further information on logging. |
| |
| ### Monitoring |
| |
| * Standard CloudWatch monitoring metrics for each component |
| * Application level CloudWatch monitoring can be enabled as described [here](../Configuration.md#cloudwatch-monitoring) |
| * Prometheus and Grafana stack is currently not available for dual-primary, but a change is in progress to allow this |
| (see [Issue 12979](https://bugs.chromium.org/p/gerrit/issues/detail?id=12979)) |
| |
| ## How to run it |
| |
| ### 0 - Prerequisites |
| |
| Follow the steps described in the [Prerequisites](../Prerequisites.md) section |
| |
| ### 1 - Configuration |
| |
| Please refer to the [configuration docs](../Configuration.md) to understand how to set up the |
| configuration and what common configuration values are needed. |
| On top of that, you might set the additional parameters, specific for this recipe. |
| |
| #### Environment |
| |
| Configuration values affecting deployment environment and cluster properties |
| |
| * `EVENTSBROKER_LIB_VER`: Mandatory. Use a [version which is compatible](https://repo1.maven.org/maven2/com/gerritforge/) with your gerrit version. Using incompatible versions will prevent the service from starting up. |
| * `GLOBALREFDB_LIB_VER`: Mandatory. Use a [version which is compatible](https://repo1.maven.org/maven2/com/gerritforge/) with your gerrit version. Using incompatible versions will prevent the service from starting up. |
| * `SERVICE_PRIMARY1_STACK_NAME`: Optional. Name of the primary 1 service stack. `gerrit-service-primary-1` by default. |
| * `SERVICE_PRIMARY2_STACK_NAME`: Optional. Name of the primary 2 service stack. `gerrit-service-primary-2` by default. |
| * `DASHBOARD_STACK_NAME` : Optional. Name of the dashboard stack. `gerrit-dashboard` by default. |
| * `HTTP_PRIMARY1_SUBDOMAIN`: Optional. Name of the primary 1 sub domain for HTTP traffic. `gerrit-http-primary-1-demo` by default. |
| * `SSH_PRIMARY1_SUBDOMAIN`: Optional. Name of the primary 1 sub domain for SSH traffic. `gerrit-ssh-primary-1-demo` by default. |
| * `HTTP_PRIMARY2_SUBDOMAIN`: Optional. Name of the primary 2 sub domain for HTTP traffic. `gerrit-http-primary-2-demo` by default. |
| * `SSH_PRIMARY2_SUBDOMAIN`: Optional. Name of the primary 2 sub domain for SSH traffic. `gerrit-ssh-primary-2-demo` by default. |
| * `HTTP_REPLICA_SUBDOMAIN`: Mandatory. The subdomain of the Gerrit replica for HTTP traffic. For example: `<AWS_PREFIX>-http-replica` |
| * `SSH_REPLICA_SUBDOMAIN`: Mandatory. The subdomain of the Gerrit replica for SSH traffic. For example: `<AWS_PREFIX>-ssh-replica` |
| * `HTTP_PRIMARIES_GERRIT_SUBDOMAIN`: Mandatory. The subdomain of the lb serving HTTP traffic to both primary gerrit instances. |
| For example: `<AWS_PREFIX>-http-primaries` |
| * `SSH_PRIMARIES_GERRIT_SUBDOMAIN`: Mandatory. The subdomain of the lb serving SSH traffic to both primary gerrit instances. |
| For example: `<AWS_PREFIX>-ssh-primaries` |
| * `PRIMARY_FILESYSTEM_THROUGHPUT_MODE`: Optional. The throughput mode for the primary file system to be created. |
| default: `bursting`. More info [here](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-efs-filesystem.html) |
| * `PRIMARY_FILESYSTEM_PROVISIONED_THROUGHPUT_IN_MIBPS`: Optional. Only used when `PRIMARY_FILESYSTEM_THROUGHPUT_MODE` is set to `provisioned`. |
| default: `256`. |
| |
| * `GERRIT_PRIMARY_INSTANCE_ID`: Optional. Identifier for the ALL Gerrit primaries instance. |
| "gerrit-dual-primary-PRIMARY" by default. |
| * `GERRIT_REPLICA_INSTANCE_ID`: Optional. Identifier for the Gerrit replica instance. |
| "gerrit-dual-primary-REPLICA" by default. |
| |
| * `PRIMARY_MAX_COUNT`: Optional. Maximum number of EC2 instances in the primary autoscaling group. |
| "2" by default. Minimum: "2". |
| |
| * `GERRIT_VOLUME_SNAPSHOT_ID` : Optional. Id of the EBS volume snapshot used to |
| create new EBS volume for Gerrit data. A new volume will be created for each |
| primary, based on this snapshot. |
| |
| Note that, differently from other recipes, dual-primary does not support the |
| `GERRIT_VOLUME_ID` parameter, since it wouldn't be possible to mount the same |
| EBS on multiple EC2 instances. |
| |
| * `GERRIT_VOLUME_SIZE_IN_GIB`: Optional. The size of the Gerrit data volume, in GiBs. `10` by default. |
| * `FILESYSTEM_ID`: Optional. An existing EFS filesystem id. |
| |
| If empty, a new EFS will be created to store git data. |
| Setting this value is required when deploying a dual-primary cluster using |
| existing data as well as performing blue/green deployments. |
| The nested stack will be *retained* when the cluster is deleted, so that |
| existing data can be used to perform blue/green deployments. |
| |
| * `AUTOREINDEX_POLL_INTERVAL`. Optional. Interval between reindexing of all changes, accounts and groups. |
| Default: `10m` |
| high-availability docs [here](https://gerrit.googlesource.com/plugins/high-availability/+/refs/heads/master/src/main/resources/Documentation/config.md) |
| |
| * `DYNAMODB_LOCKS_TABLE_NAME`. Optional. The name of the dynamoDB table used to |
| store distribute locking. |
| Default: `locksTable` |
| See DynamoDB lock client [here](https://github.com/awslabs/amazon-dynamodb-lock-client) |
| |
| * `DYNAMODB_REFS_TABLE_NAME`. Optional. The name of the dynamoDB table used to |
| store git refs and their associated sha1. |
| Default: `refsDb` |
| |
| * `CREATE_REFS_DB_TABLES`. Optional. Whether to create the DynamoDB refs and |
| lock tables. |
| Default: `false` |
| |
| ##### Shared filesystem for replicas |
| |
| Similarly to primary nodes, replicas share a data via an EFS filesystem which is |
| mounted under the `/var/gerrit/git` directory. This allows git data to persist |
| beyond the lifespan of a single instance and to be shared so that replicas can |
| scale down and up according to needs. |
| |
| * `REPLICA_FILESYSTEM_ID`: Optional. An existing EFS filesystem id to mount on replicas. |
| |
| If empty, a new EFS will be created to store git data. |
| Setting this value is required when deploying a dual-primary cluster using |
| existing data as well as performing blue/green deployments. |
| The nested stack will be *retained* when the cluster is deleted, so that |
| existing data can be used to perform blue/green deployments. |
| |
| * `REPLICA_FILESYSTEM_THROUGHPUT_MODE`: Optional. The throughput mode for the file system to be created. |
| default: `bursting`. More info [here](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-efs-filesystem.html) |
| |
| * `REPLICA_FILESYSTEM_PROVISIONED_THROUGHPUT_IN_MIBPS`: Optional. Only used when `REPLICA_FILESYSTEM_THROUGHPUT_MODE` is set to `provisioned`. |
| default: `256`. |
| |
| ##### Auto Scaling of replicas instances |
| |
| Gerrit replicas have the ability to scale in or out automatically to accommodate |
| to the increase or decrease of traffic. The traffic might be typically coming |
| from build or test jobs executed by some sort of automated build pipeline. |
| |
| Since they all [share the same git data over EFS](#shared-filesystem-for-replicas), |
| replicas are immediately ready to serve traffic as soon as they come up and |
| register behind the loadbalancer. |
| |
| There is a 1 to 1 relationship between replica and EC2 instances: on each EC2 |
| instance in the 'replica' ASG, runs one and only one replica task. |
| Because of this, when specifying the capacity for replicas (minimum, desired and |
| maximum), they will both configure for the capacity of tasks as well as the |
| capacity of the ASG, since they always need to be in sync. |
| |
| The scaling policy adds or removes capacity as required to keep the average CPU |
| Usage (of the replica service) close to the specified target value. |
| |
| Now, tasks in the provisioning state that cannot find sufficient resources on |
| the existing instances will automatically trigger the capacity provider to scale |
| out the replica ASG. As more EC2 instances become available, tasks in the |
| provisioning state will get placed onto those instances, reducing the number of |
| tasks in provisioning. |
| |
| Conversely, as the average CPU usage (of the replica service) drops under the |
| specified target value, and replica tasks get removed, the capacity provider |
| will reduce the number of EC2 instances too. |
| |
| Note that only EC2 instances that are not running any replica task will scale in. |
| |
| These are the available settings: |
| |
| * `REPLICA_AUTOSCALING_MIN_CAPACITY` Optional. The minimum number of tasks that |
| replicas should scale in to. This is also the minimum number of EC2 instances in |
| the replica ASG |
| default: *1* |
| |
| * `REPLICA_AUTOSCALING_DESIRED_CAPACITY` Optional. The desired number of |
| replica tasks to run. This is also the desired number of EC2 instances in the |
| replica ASG. |
| default: *1* |
| |
| * `REPLICA_AUTOSCALING_MAX_CAPACITY` Optional. The maximum number of tasks that |
| replicas should scale out to. This is also the maximum number of EC2 instances |
| in the replica ASG |
| default: *2* |
| |
| * `REPLICA_AUTOSCALING_SCALE_IN_COOLDOWN` Optional. The amount of time, in |
| seconds, after a scale-in activity completes before another scale-in activity |
| can start |
| default: *300* seconds |
| |
| * `REPLICA_AUTOSCALING_SCALE_OUT_COOLDOWN` Optional. The amount of time, in |
| seconds, to wait for a previous scale-out activity to take effect |
| default: *300* seconds |
| |
| * `REPLICA_AUTOSCALING_TARGET_CPU_PERCENTAGE` Optional. Aggregate CPU |
| utilization target for auto-scaling. Auto-scaling will add or remove tasks in |
| the replica service to be as close as possible to this value |
| |
| * `REPLICA_CAPACITY_PROVIDER_TARGET` Optional. The target capacity value for the |
| capacity provider of replicas (must be > 0 and <= 100). |
| default: *100* |
| |
| Setting this value to 100 means that there will be no _spare capacity_ |
| allocated on the replica ASG: |
| |
| If 3 replica tasks are needed, then the ASG will adjust to have exactly 3 EC2 |
| |
| Setting this value to less than 100 enables spare capacity in the ASG. For |
| example, if you set this value to 50 the scaling policy will adjust the EC2 |
| until it is exactly twice the number of instances needed to run all of the |
| tasks: |
| |
| If 3 replica tasks are needed, then there ASG will adjust to 6 EC2 |
| |
| * `REPLICA_CAPACITY_PROVIDER_MIN_STEP_SIZE` Optional. The minimum number of EC2 |
| instances for replicas that will scale in or scale out at one time (must be >= 1 |
| and <= 10) |
| default: *1* |
| |
| * `REPLICA_CAPACITY_PROVIDER_MAX_STEP_SIZE` Optional. The maximum number of EC2 |
| instances for replicas that will scale in or scale out at one time (must be >= 1 |
| and <= 10) |
| default: *1* |
| |
| #### REPLICATION SERVICE |
| |
| * `REPLICATION_SERVICE_ENABLED`: Optional. Whether to expose a replication endpoint. |
| "false" by default. |
| * `SERVICE_REPLICATION_STACK_NAME`: Optional. The name of the replication service stack. |
| "git-replication-service" by default. |
| * `SERVICE_REPLICATION_DESIRED_COUNT`: Optional. Number of wanted replication tasks. |
| "1" by default. |
| * `GIT_REPLICATION_SUBDOMAIN`: Optional. The subdomain to use for the replication endpoint. |
| "git-replication" by default. |
| |
| It is also posssible to replicate *to* an extra target by providing a FQDN. |
| The target is expected to expose port 9148 and port 1022 for git and git admin |
| operations respectively. |
| |
| * `REMOTE_REPLICATION_TARGET_HOST`: Optional. The fully qualified domain name of a remote replication target. |
| Empty by default. |
| |
| The replication service and the remote replication target represent the reading |
| and writing sides of Git replication: by enabling both of them, it is possible to |
| establish replication to a remote Git site. |
| |
| #### MULTI-SITE |
| |
| This recipe supports multi-site. Multi-site is a specific configuration of Gerrit |
| that allows it to be part of distributed multi-primary of multiple Gerrit clusters. |
| No storage is shared among the Gerrit sites: syncing happens thanks to two |
| channels: |
| |
| * The `replication` plugin allow alignment of git data (see [replication service](#replication-service)) |
| for how to enable this. |
| * The `multi-site` group of plugins and resources allow the coordination and the exchange |
| of gerrit specific events that are produced and consumed by the members of the multi-site deployment. |
| (See the [multi-site design](https://gerrit.googlesource.com/plugins/multi-site/+/refs/heads/stable-3.2/DESIGN.md) |
| for more information on this. |
| |
| ##### Requirements |
| * Kafka brokers and DynamoDB are required by this recipe and are expected to exist |
| and accessible with server-side TLS security enabled by the primary instances |
| resulting from the deployment of this recipe. |
| * Replication service must be enabled to allow syncing of Git data. |
| |
| These are the parameters that can be specified to enable/disable multi-site: |
| |
| * `MULTISITE_ENABLED`: Optional. Whether this Gerrit is part of a multi-site |
| cluster deployment. "false" by default. |
| * `MULTISITE_KAFKA_BROKERS`: Required when "MULTISITE_ENABLED=true". |
| Comma separated list of Kafka broker hosts (host:port) |
| to use for publishing events to the message broker. |
| * `MULTISITE_GLOBAL_PROJECTS`: Optional. Comma separated list of patterns (see [projects.pattern](https://gerrit.googlesource.com/plugins/multi-site/+/refs/heads/stable-3.2/src/main/resources/Documentation/config.md)) |
| to specify which projects are available across all sites. This parametes applies to both multi-site |
| and replication service remote destinations. |
| Empty by default which means that all projects are available across all sites. |
| |
| ### 2 - Deploy |
| |
| * Create the cluster, services and DNS routing stacks: |
| |
| ``` |
| make [AWS_REGION=a-valid-aws-region] [AWS_PREFIX=some-cluster-prefix] create-all |
| ``` |
| |
| The optional `AWS_REGION` and `AWS_REFIX` allow you to define where it will be deployed and what it will be named. |
| |
| It might take several minutes to build the stack. |
| You can monitor the creations of the stacks in [CloudFormation](https://console.aws.amazon.com/cloudformation/home) |
| |
| * *NOTE*: the creation of the cluster needs an EC2 key pair are useful when you need to connect |
| to the EC2 instances for troubleshooting purposes. The key pair is automatically generated |
| and stored in a `pem` file on the current directory. |
| To use when ssh-ing into your instances as follow: `ssh -i cluster-keys.pem ec2-user@<ec2_instance_ip>` |
| |
| #### Replication-Service |
| |
| Optionally this recipe can be deployed so that replication onto the share EFS volume |
| is available. |
| |
| By setting the environment variable `REPLICATION_SERVICE_ENABLED=true`, this recipe will |
| set up and configure additional resources that will allow other other sites to replicate |
| to a specific endpoint, exposed as: |
| |
| * For GIT replication |
| `$(GIT_REPLICATION_SUBDOMAIN).$(HOSTED_ZONE_NAME):9148` |
| |
| * For Git Admin replication |
| `$(GIT_REPLICATION_SUBDOMAIN).$(HOSTED_ZONE_NAME):1022` |
| |
| The service will persist git data on the same EFS volume mounted by the gerrit |
| primary1 and gerrit primary2. |
| |
| Note that the replication endpoint is not internet-facing, thus replication requests |
| must be coming from a peered VPC. |
| |
| ### Cleaning up |
| |
| ``` |
| make [AWS_REGION=a-valid-aws-region] [AWS_PREFIX=some-cluster-prefix] delete-all |
| ``` |
| |
| The optional `AWS_REGION` and `AWS_REFIX` allow you to specify exactly which stack you target for deletion. |
| |
| Note that this will *not* delete: |
| * Secrets stored in Secret Manager |
| * SSL certificates |
| * ECR repositories |
| * EFS stack |
| * VPC and subnets (if created as part of this deployment, rather than externally |
| provided) |
| * Refs-DB DynamoDB stack (if created as part of this deployment, rather than |
| externally provided)) |
| |
| Note that you can completely delete the stack, including explicitly retained |
| resources such as the EFS Git filesystem, VPC and subnets and DynamoDB stack by |
| issuing the more aggressive command: |
| |
| ``` |
| make [AWS_REGION=a-valid-aws-region] [AWS_PREFIX=some-cluster-prefix] delete-all-including-retained-stack |
| ``` |
| |
| Note that this will execute a prompt to confirm your choice: |
| |
| ``` |
| * * * * WARNING * * * * this is going to completely destroy the stack, including git data. |
| |
| Are you sure you want to continue? [y/N] |
| ``` |
| |
| If you want to automate this programmatically you can just pipe the `yes` |
| command to the make: |
| |
| ``` |
| yes | make [AWS_REGION=a-valid-aws-region] [AWS_PREFIX=some-cluster-prefix] delete-all-including-retained-stack |
| ``` |
| |
| ### Access your Gerrit instances |
| |
| Get the URL of your Gerrit primary instances this way: |
| |
| ``` |
| aws cloudformation describe-stacks \ |
| --stack-name <SERVICE_PRIMARY1_STACK_NAME> \ |
| | grep -A1 '"OutputKey": "CanonicalWebUrl"' \ |
| | grep OutputValue \ |
| | cut -d'"' -f 4 |
| |
| aws cloudformation describe-stacks \ |
| --stack-name <SERVICE_PRIMARY2_STACK_NAME> \ |
| | grep -A1 '"OutputKey": "CanonicalWebUrl"' \ |
| | grep OutputValue \ |
| | cut -d'"' -f 4 |
| ``` |
| |
| Gerrit primary instance ports: |
| * HTTP `8080` |
| * SSH `29418` |
| |
| ### External Services |
| |
| If you need to setup some external services (maybe for testing purposes, such as SMTP or LDAP), |
| you can follow the instructions [here](../README.md#external-services) |
| |
| ### Docker |
| |
| Refer to the [Docker](../Docker.md) section for information on how to setup docker or how to publish images |