blob: bef0a7947357e4a4cfac73305fe774b7e92e5c70 [file] [log] [blame] [view]
---
title: "Scaling Dimensions"
permalink: scaling.html
hide_sidebar: true
hide_navtoggle: true
toc: false
---
## Scaling Dimensions
As you scale, you will always run into some issues. Depending on your specific
setup those issues may be very different from what other people are running
into. Some of the specific dimensions which people may or may not need to scale
on are:
* Number of Changes: Several sites have close to 1M changes.
* Number of Projects: Some Gerrit installs have 20K projects, many have at
least a few thousand
* Number of Refs: Several sites have projects with over 100K refs, at least
one has close to 500K refs
* Number of Users: Many installs serve 1K users, some serve at least 3K.
* Number of Groups: Some Gerrit installs use 6K groups (most issues have been
worked out here)
* Single Repository Sizes: Some Gerrit projects are 40G aggressively packed,
these can often cause issues with email templates taking lots of CPU
* Total Repository Sizes: Gerrit can handle at least 1TB of repository data
easily
* Large Files: Gerrit may have difficulty with some very large files (what
size can it handle easily?)
* Number of Slaves: Several installations have over 10 slaves per master, at
least one has 25.
* Number of Continents with slaves: Slaves for a single master have been
distributed across at least 3 continents successfully
* Number of Receive-Packs: Some Gerrit masters are handling 20k receive-packs
(pushes) per day.
--------------------------------------------------------------------------------
## Servers
### Master
The first step to scaling is to scale your master server. Some easy, but pricey
ways to scale your master are:
* Adding cores, some of the larger installations use 48 core machines
* Adding RAM, most of the larger installations have over 100GB, at least one
has 1TB
* Ensure fast disk IO, SSDs help prevent serious performance degradation when
repos are not well repacked (seeks can be crippling here).
* Network, I suspect that most large installs use 10Gb Ethernet
### Mirrors/Slaves
Once you have a decent master, it is probably worth adding either some git
mirrors (if you do not need ACLs on your repos), or Gerrit slaves to help
offload much of your read only requests. Mirrors/Slaves can also help reduce LAN
and WAN traffic if you place them nearer to your users/build hosts. This can be
particularly useful for remote sites. Some of the larger installations have at
least 25 of these.
#### Shared Storage and Replication Entries For Slaves
A common practice is to use site local shared storage (NFS...) on remote slaves
when there is more than one slave at the remote site. One major advantage of
this is that it reduces the data required to be pushed during replication to
that site. This requires consolidating the replication events to those slaves in
order to avoid having duplicated pushes to the same storage. This consolidation
means that the master replication file will only have one entry for each set of
slaves on the same storage. While a single slave could be setup as the sole
replication receiver, added availability and scaling is being reliably achieved
by using a load balancer on the receiving end to distribute each incoming push
to a different slave (since the back-end storage is the same, they all will
still see every update).
#### DB Slaves
DB slaves are being used on remote sites so that remote slaves do not have to
traverse the WAN to talk to the master DB. Both PostGreSQL and MYSQL are being
used successfully for this. This can be particularly helpful to help reduce some
WAN traffic related to high ref counts when doing repo syncs (the [change cache]
(https://gerrit-review.googlesource.com/#/c/35220) was also designed to help
with this.)
### Multi - Master
The Gerrit MultiMaster plug-in describes how to setup a single site multi-master
with a shared storage back-end for git repository data. However, there are
currently no known sites using the open-source MM technology yet. The google
hosted gerrit-review site currently runs in multi-site multi-master mode, but it
relies on proprietary technology to do so.
--------------------------------------------------------------------------------
## Jetty
The default built in web container which Gerrit uses is Jetty. Some
installations have had serious "Failed to dispatch" errors which lead to 100%CPU
and filled up logs, requiring a server reboot to recover. This can triggered by
long running RPCs building causing the http queue to be used. One way to
workaround this issue is to set httpd.maxQueued = 0. Alternatively, you can use
[Tomcat]
(https://gerrit-review.googlesource.com/#/c/35010/6/Documentation/install-tomcat.txt)
instead to replace Jetty.
--------------------------------------------------------------------------------
## Repo Syncs
With beefier servers, many people have [seen]
(http://groups.google.com/group/repo-discuss/browse_thread/thread/c8f003f2247d7157/ad6915f5558df8f5?lnk=gst&q=repo+sync+error#ad6915f5558df8f5)
channel master issues with ssh. Setting GIT\_SSH will cause repo to avoid using
channel master, and thus avoid triggering these errors:
```
export GIT\_SSH=$(which ssh)
```
Is this related to resetting the key after a certain amount of data?
--------------------------------------------------------------------------------
## Java HEAP and GC
Operations on git repositories can consume lots of memory. If you consume more
memory than your java heap, your server may either run out of memory and fail,
or simply thrash forever while java gciing. Large fetches such as clones tend to
be the largest RAM consumers on a Gerrit system. Since the total potential
memory load is generally proportional to the total amount of SSH threads and
replication threads combined, it is a good idea to configure your heap size and
thread counts together to form a safe combination. One way to do that is to
first determine your maximum memory usage per thread. Once you have determined
the per thread usage, you can tune your server so that you total thread count
multiplied by your maximum memory usage per thread, does not exceed your heap
size.
One way to figure out your maximum memory usage per thread, is to find your
maximum git clone tipping point. Your tipping point is the maximum number of
clones that you can perform in parallel without causing your server to run out
of memory. To do this, you must first tune your server so that your ssh threads
are set to a higher than safe value. You must set it to a value at least as high
as the number of parallel clones you are going to attempt. When ready, increase
your testing with higher and higher clone counts until the server tips, then
deduce the point right before it tips. It helps to use your "worst" repository
for this. Once you have found the tipping point, you can calculate the
approximate per thread memory usage by dividing your heap size by your clone
count. If you find that you still have large java gc, you may further want to
reduce your thread counts.
Your luck may vary with tweaking your jvm gc parameters. You may find that
increasing the size of the young generation may help drastically reduce the
amount of gc thrashing your server performs.
--------------------------------------------------------------------------------
## Replication
There are many scalability issues which can plague replication, most are related
to high ref counts, those are not specifically mentioned here, so you will
likely need to first be familiar with the "High Ref Counts" section to make
replication run smoothly.
### JSch
Jsch has threading issues which seem to serialize replication even across worker
groups. This has lead some teams to perform replication without using ssh (Jsch
is the ssh implementation used inside jgit). To do this, you may setup a "write
only" git deamon on your slaves with a port only open to your Gerrit master and
replicate via git daemon without authentication or encryption. This is
particularly useful if you have sites which replicate to at very different
speeds.
### Failed To Lock
With older versions of the replication plug-in, your replication can start
running into contention and failing with "Failed to Lock" errors in your logs.
This can happen when 2 separate threads attempt to replicate the same
project/branch combination (the plug-in no longer allows this). This problem can
resurface even with the newer plug-ing if you run a MultiMaster setup since
nothing currently prevents two different masters running the replciation plug-in
for the same instance from pushing the same ref at the same time.
There are other scenarios besides replication contention that can cause "Failed
to Lock" errors. Fortunately, the current version of the replication plug-in can
be configured to retry these failed pushes.
--------------------------------------------------------------------------------
## High Ref Counts
High ref counts can have impacts in many places in the git/jgit/Gerrit stacks.
There are many ongoing fixes and tweaks to alleviated many of these problems,
but some of them still remain. Some can be "unofficially" worked around.
### git daemon mirrors
Current versions (prior to git 1.7.11) will use an [excessive amount of CPU]
(http://marc.info/?l=git&m=133310001303068&w=2) when receiving pushes on sites
with high ref counts. Upgrading git there can help drastically reduce your
replication time in these cases.
### git
Suggest to your users that they use the latest git possible, many of the older
versions (which are still the defaults on many distros) have severe problems
with high ref counts. Particularly [bad]
(http://marc.info/?l=git&m=131552810309538&w=2) versions are between 1.7.4.1 and
1.7.7. Git 1.8.1 seems to have some speed-ups in fetches of high ref counts
compared to even 1.7.8.
### jgit
jGit still has a [performance problem]
(http://groups.google.com/group/repo-discuss/browse_thread/thread/d0914922dc565516)
with high refs. Two diffenert patches, the [bucket queue]
(https://git.eclipse.org/r/#/c/24295/) patch and the [integer priority queue]
(https://git.eclipse.org/r/5491) patch, have been proposed and will drastically
reduce upload and replication times in Gerrit if applied for repos with many (>
60K?) patch sets.
There are some very high performance patches which make jgit extremely fast.
### Tags
If you have android repositories, you likely use around 400-600 of them. Cross
project tagging can be [problematic]
(http://marc.info/?l=git&m=133772533609422&w=2). There are no good solutions yet
to this problem.
--------------------------------------------------------------------------------
### ACLS
On servers with little or no anonymous access, and large change counts, it can
be disastrous when non-logged-in users access a change-list page. A change-list
page scans all the changes and skips the changes a user cannot see until it has
found enough changes to display which the user can see. When there is no
anonymous access, this may mean traversing all of the changes in your instance
only to return a blank list. When the change count starts approaching 1M
changes, this large change traversal can cripple even a very high scale DB and
Gerrit combination. This is most prevalent on Monday mornings after your users
return to the office and have not logged into Gerrit yet (but are still
accessing it). One hacky way to deal with this is to potentially make a change
to Gerrit to never run the ChangeList page for non-logged in users. However,
this is not a viable solution for public sites. Of course, public sites likely
do not have 1M changes which are not visible to non-logged in users. It may make
sense to make this a configuration option in Gerrit at some point if this cannot
be sped up?
--------------------------------------------------------------------------------
## Disk Space / File Cleanup
Installations which do not have enough spare disk space for their repos can run
into problems easily. Be aware that git repos contain highly compressed data and
that at times this data may need to be uncompressed. It is easy to underestimate
the temporary needs of repositories because git is so good at compressing this
data. However, minor changes can cause repositories to "explode" so it is good
to plan for this and leave a lot of free space for this to never be an issue.
This is particularly important for those using SSDs where they might be more
likely to skimp on space.
### Git GC Repo Explosions
Under certain conditions git gc can cause a repo explosion (jgit gc does not
suffer from this problem because it puts unreachable objects in a packfile),
primarily when unreferenced objects are removed from pack files and are placed
as loose refs in the file-system. Eventually git gc should prune these, but
until that happens serious problems can occur.
Some of the situations which can cause many unreferenced objects:
* A user uploads a change to the wrong repository and it gets rejected by
Gerrit
* Tags are [deleted](http://marc.info/?l=git&m=131829057610072&w=2) from the
linux repo
### Git GC
Running GC regularly is important, particularly on sites with heavy uploads.
Older versions of jgit do not have built in gc and require using git gc. Setting
up a crontab is probably a good idea in these case. If you do run gc too often,
however excessive pack file churn can also be a problem. A potential [solution]
(https://gerrit-review.googlesource.com/#/c/35215/) for packfile churn:
> nice -n 19 ionice -c 3 gitexproll.sh -r 5 ...
Pack file churn can lead to several issues, RAM utilization, Disk utilization
and excessive WAN utilization for file-system mirroring scripts (such as rysnc).
### Keep and Noz files
Currently, Gerrit may leave behind some temporary files in your git repos when
it shuts down (particularly if ungraceful). There are some temporary files which
begin with "noz", this can consume disk space if left uncleaned. There are also
some .keep files in the objects/pack directories which can be left behind, these
don't in themselves don't take space, but they will prevent git gc from
repacking the packfile they are associated with which can lead to poor disk
space utilization and performance issues.
### ~/.gerritcodereview
The temporary unjared war files in here can build up. (This has been move to
review\_site/tmp in Gerrit 2.5+)
--------------------------------------------------------------------------------
## hooks
Servers with lots of RAM are susceptible to slow forks which can delay each hook
invocation quite a bit. When java uses over 10G of memory, it may add at least a
second to each hook invocation. Using java 7 seems to avoid this problem and
makes hooks blazingly fast again.