As you scale, you will always run into some issues. Depending on your specific setup those issues may be very different from what other people are running into. Some of the specific dimensions which people may or may not need to scale on are:
The first step to scaling is to scale your master server. Some easy, but pricey ways to scale your master are:
Once you have a decent master, it is probably worth adding either some git mirrors (if you do not need ACLs on your repos), or Gerrit slaves to help offload much of your read only requests. Mirrors/Slaves can also help reduce LAN and WAN traffic if you place them nearer to your users/build hosts. This can be particularly useful for remote sites. Some of the larger installations have at least 25 of these.
A common practice is to use site local shared storage (NFS...) on remote slaves when there is more than one slave at the remote site. One major advantage of this is that it reduces the data required to be pushed during replication to that site. This requires consolidating the replication events to those slaves in order to avoid having duplicated pushes to the same storage. This consolidation means that the master replication file will only have one entry for each set of slaves on the same storage. While a single slave could be setup as the sole replication receiver, added availability and scaling is being reliably achieved by using a load balancer on the receiving end to distribute each incoming push to a different slave (since the back-end storage is the same, they all will still see every update).
DB slaves are being used on remote sites so that remote slaves do not have to traverse the WAN to talk to the master DB. Both PostGreSQL and MYSQL are being used successfully for this. This can be particularly helpful to help reduce some WAN traffic related to high ref counts when doing repo syncs (the [change cache] (https://gerrit-review.googlesource.com/#/c/35220) was also designed to help with this.)
The Gerrit MultiMaster plug-in describes how to setup a single site multi-master with a shared storage back-end for git repository data. However, there are currently no known sites using the open source MM technology yet. The google hosted gerrit-review site currently runs in multi-site multi-master mode, but it relies on proprietary technology to do so.
The default built in web container which Gerrit uses is Jetty. Some installations have had serious “Failed to dispatch” errors which lead to 100%CPU and filled up logs, requiring a server reboot to recover. This can triggered by long running RPCs building causing the http queue to be used. One way to workaround this issue is to set httpd.maxQueued = 0. Alternatively, you can use [Tomcat] (https://gerrit-review.googlesource.com/#/c/35010/6/Documentation/install-tomcat.txt) instead to replace Jetty.
With beefier servers, many people have [seen] (http://groups.google.com/group/repo-discuss/browse_thread/thread/c8f003f2247d7157/ad6915f5558df8f5?lnk=gst&q=repo+sync+error#ad6915f5558df8f5) channel master issues with ssh. Setting GIT_SSH will cause repo to avoid using channel master, and thus avoid triggering these errors:
export GIT\_SSH=$(which ssh)
Is this related to resetting the key after a certain amount of data?
Operations on git repositories can consume lots of memory. If you consume more memory than your java heap, your server may either run out of memory and fail, or simply thrash forever while java gciing. Large fetches such as clones tend to be the largest RAM consumers on a Gerrit system. Since the total potential memory load is generally proportional to the total amount of SSH threads and replication threads combined, it is a good idea to configure your heap size and thread counts together to form a safe combination. One way to do that is to first determine your maximum memory usage per thread. Once you have determined the per thread usage, you can tune your server so that you total thread count multiplied by your maximum memory usage per thread, does not exceed your heap size.
One way to figure out your maximum memory usage per thread, is to find your maximum git clone tipping point. Your tipping point is the maximum number of clones that you can perform in parallel without causing your server to run out of memory. To do this, you must first tune your server so that your ssh threads are set to a higher than safe value. You must set it to a value at least as high as the number of parallel clones you are going to attempt. When ready, increase your testing with higher and higher clone counts until the server tips, then deduce the point right before it tips. It helps to use your “worst” repository for this. Once you have found the tipping point, you can calculate the approximate per thread memory usage by dividing your heap size by your clone count. If you find that you still have large java gc, you may further want to reduce your thread counts.
Your luck may vary with tweaking your jvm gc parameters. You may find that increasing the size of the young generation may help drastically reduce the amount of gc thrashing your server performs.
There are many scalability issues which can plague replication, most are related to high ref counts, those are not specifically mentioned here, so you will likely need to first be familiar with the “High Ref Counts” section to make replication run smoothly.
Jsch has threading issues which seem to serialize replication even across worker groups. This has lead some teams to perform replication without using ssh (Jsch is the ssh implementation used inside jgit). To do this, you may setup a “write only” git deamon on your slaves with a port only open to your Gerrit master and replicate via git daemon without authentication or encryption. This is particularly useful if you have sites which replicate to at very different speeds.
With older versions of the replication plug-in, your replication can start running into contention and failing with “Failed to Lock” errors in your logs. This can happen when 2 separate threads attempt to replicate the same project/branch combination (the plug-in no longer allows this). This problem can resurface even with the newer plug-ing if you run a MultiMaster setup since nothing currently prevents two different masters running the replciation plug-in for the same instance from pushing the same ref at the same time.
There are other scenarios besides replication contention that can cause “Failed to Lock” errors. Fortunately, the current version of the replication plug-in can be configured to retry these failed pushes.
High ref counts can have impacts in many places in the git/jgit/Gerrit stacks. There are many ongoing fixes and tweaks to alleviated many of these problems, but some of them still remain. Some can be “unofficially” worked around.
Current versions (prior to git 1.7.11) will use an [excessive amount of CPU] (http://marc.info/?l=git&m=133310001303068&w=2) when receiving pushes on sites with high ref counts. Upgrading git there can help drastically reduce your replication time in these cases.
Suggest to your users that they use the latest git possible, many of the older versions (which are still the defaults on many distros) have severe problems with high ref counts. Particularly [bad] (http://marc.info/?l=git&m=131552810309538&w=2) versions are between 1.7.4.1 and 1.7.7. Git 1.8.1 seems to have some speed-ups in fetches of high ref counts compared to even 1.7.8.
jGit still has a [performance problem] (http://groups.google.com/group/repo-discuss/browse_thread/thread/d0914922dc565516) with high refs. Two diffenert patches, the [bucket queue] (https://git.eclipse.org/r/#/c/24295/) patch and the [integer priority queue] (https://git.eclipse.org/r/5491) patch, have been proposed and will drastically reduce upload and replication times in Gerrit if applied for repos with many (> 60K?) patch sets.
There are some very high performance patches which make jgit extremely fast.
If you have android repositories, you likely use around 400-600 of them. Cross project tagging can be [problematic] (http://marc.info/?l=git&m=133772533609422&w=2). There are no good solutions yet to this problem.
On servers with little or no anonymous access, and large change counts, it can be disastrous when non-logged-in users access a change-list page. A change-list page scans all the changes and skips the changes a user cannot see until it has found enough changes to display which the user can see. When there is no anonymous access, this may mean traversing all of the changes in your instance only to return a blank list. When the change count starts approaching 1M changes, this large change traversal can cripple even a very high scale DB and Gerrit combination. This is most prevalent on Monday mornings after your users return to the office and have not logged into Gerrit yet (but are still accessing it). One hacky way to deal with this is to potentially make a change to Gerrit to never run the ChangeList page for non-logged in users. However, this is not a viable solution for public sites. Of course, public sites likely do not have 1M changes which are not visible to non-logged in users. It may make sense to make this a configuration option in Gerrit at some point if this cannot be sped up?
Installations which do not have enough spare disk space for their repos can run into problems easily. Be aware that git repos contain highly compressed data and that at times this data may need to be uncompressed. It is easy to underestimate the temporary needs of repositories because git is so good at compressing this data. However, minor changes can cause repositories to “explode” so it is good to plan for this and leave a lot of free space for this to never be an issue. This is particularly important for those using SSDs where they might be more likely to skimp on space.
Under certain conditions git gc can cause a repo explosion (jgit gc does not suffer from this problem because it puts unreachable objects in a packfile), primarily when unreferenced objects are removed from pack files and are placed as loose refs in the file-system. Eventually git gc should prune these, but until that happens serious problems can occur.
Some of the situations which can cause many unreferenced objects:
Running GC regularly is important, particularly on sites with heavy uploads. Older versions of jgit do not have built in gc and require using git gc. Setting up a crontab is probably a good idea in these case. If you do run gc too often, however excessive pack file churn can also be a problem. A potential [solution] (https://gerrit-review.googlesource.com/#/c/35215/) for packfile churn:
nice -n 19 ionice -c 3 gitexproll.sh -r 5 ...
Pack file churn can lead to several issues, RAM utilization, Disk utilization and excessive WAN utilization for file-system mirroring scripts (such as rysnc).
Currently, Gerrit may leave behind some temporary files in your git repos when it shuts down (particularly if ungraceful). There are some temporary files which begin with “noz”, this can consume disk space if left uncleaned. There are also some .keep files in the objects/pack directories which can be left behind, these don‘t in themselves don’t take space, but they will prevent git gc from repacking the packfile they are associated with which can lead to poor disk space utilization and performance issues.
The temporary unjared war files in here can build up. (This has been move to review_site/tmp in Gerrit 2.5+)
Servers with lots of RAM are susceptible to slow forks which can delay each hook invocation quite a bit. When java uses over 10G of memory, it may add at least a second to each hook invocation. Using java 7 seems to avoid this problem and makes hooks blazingly fast again.