|= Gerrit Code Review - System Design
|Gerrit is a web based code review system, facilitating online code
|reviews for projects using the Git version control system.
|Gerrit makes reviews easier by showing changes in a side-by-side
|display, and allowing inline/file comments to be added by any reviewer.
|Gerrit simplifies Git based project maintainership by permitting
|any authorized user to submit changes to the master Git repository,
|rather than requiring all approved changes to be merged in by
|hand by the project maintainer. This functionality enables a more
|centralized usage of Git.
|Git is a distributed version control system, wherein each repository
|is assumed to be owned/maintained by a single user. There are no
|inherent security controls built into Git, so the ability to read
|from or write to a repository is controlled entirely by the host's
|filesystem or network access controls.
|The objective of Gerrit is to facilitate Git development by larger
|teams: it provides a means to enforce organizational policies around
|code submissions, eg. "all code must be reviewed by another
|developer", "all code shall pass tests". It achieves this by
|* providing fine-grained (per-branch, per-repository, inheriting)
|access controls, which allow a Gerrit admin to delegate permissions
|to different team(-lead)s.
|* facilitate code review: Gerrit offers a web view of pending code
|changes, that allows for easy reading and commenting by humans. The
|web view can offer data coming out of automated QA processes (eg.
|CI). The permission system also includes fine grained control of who
|can approve pending changes for submission to further facilitate
|delegation of code ownership.
|Developers create one or more changes on their local desktop system,
|then upload them for review to Gerrit using the standard `git push`
|command line program, or any GUI which can invoke `git push` on behalf
|of the user. Authentication and data transfer are handled through SSH
|and HTTPS. Uploads are protected by the authentication,
|confidentiality and integrity offered by the transport (SSH, HTTPS).
|Each Git commit created on the client desktop system is converted into
|a unique change record which can be reviewed independently.
|A summary of each newly uploaded change is automatically emailed
|to reviewers, so they receive a direct hyperlink to review the
|change on the web. Reviewer email addresses can be specified on the
|`git push` command line, but typically reviewers are added in the web
|Reviewers use the web interface to read the side-by-side or unified
|diff of a change, and insert draft inline/file comments where
|appropriate. A draft comment is visible only to the reviewer, until
|they publish those comments. Published comments are automatically
|emailed to the change author by Gerrit, and are CC'd to all other
|reviewers who have already commented on the change.
|Reviewers can score the change ("vote"), indicating whether they feel the
|change is ready for inclusion in the project, needs more work, or
|should be rejected outright. These scores provide direct feedback to
|Gerrit's change submit function.
|After a change has been scored positively by reviewers, Gerrit enables
|a submit button on the web interface. Authorized users can push the
|submit button to have the change enter the project repository. The
|user pressing the submit button does not need to be the author of the
|End-user web browsers make HTTP requests directly to Gerrit's
|HTTP server. As nearly all of the user interface is implemented
|through PolyGerrit, the majority of these requests are transmitting
|compressed JSON payloads, with all HTML being generated within the
|Gerrit's HTTP server side component is implemented as a standard Java
|servlet, and thus runs within any link:install-j2ee.html[J2EE servlet
|container]. The standard install will run inside Jetty, which is
|included in the binary.
|End-user uploads are performed over SSH or HTTP, so Gerrit's servlets
|also start up a background thread to receive SSH connections through
|an independent SSH port. SSH clients communicate directly with this
|port, bypassing the HTTP server used by browsers.
|User authentication is handled by identity realms. Gerrit supports the
|following types of authentication:
|* OpenId (see link:http://openid.net/developers/specs/[OpenID Specifications,role=external,window=_blank])
|* Google accounts (on googlesource.com)
|* 3rd party SSO
|Server side data storage for Gerrit is broken down into two different
|* Git repository data
|* Gerrit metadata
|The Git repository data is the Git object database used to store
|already submitted revisions, as well as all uploaded (proposed)
|changes. Gerrit uses the standard Git repository format, and
|therefore requires direct filesystem access to the repositories.
|All repository data is stored in the filesystem and accessed through
|the JGit library. Repository data can be stored on remote servers
|accessible through NFS or SMB, but the remote directory must
|be mounted on the Gerrit server as part of the local filesystem
|namespace. Remote filesystems are likely to perform worse than
|local ones, due to Git disk IO behavior not being optimized for
|The Gerrit metadata contains a summary of the available changes, all
|comments (published and drafts), and individual user account
|Gerrit metadata is also stored in Git, with the commits marking the
|historical state of metadata. Data is stored in the trees associated
|with the commits, typically using Git config file or JSON as the base
|format. For metadata, there are 3 types of data: changes, accounts and
|Accounts are stored in a special Git repository `All-Users`.
|Accounts can be grouped in groups. Gerrit has a built-in group system,
|but can also interface to external group system (eg. Google groups,
|LDAP). The built-in groups are stored in `All-Users`.
|Draft comments are stored in `All-Users` too.
|Permissions are stored in Git, in a branch `refs/meta/config` for the
|repository. Repository configuration (including permissions) supports
|single inheritance, with the `All-Projects` repository containing
|Code review metadata is stored in Git, alongside the code under
|review. Metadata includes change status, votes, comments. This review
|metadata is stored in NoteDb along with the submitted code and code
|under review. Hence, the review history can be exported with `git
|clone --mirror` by anyone with sufficient permissions.
|Permissions are specified on branch names, and given to groups. For
|push = group Release-Engineers
|this provides a rule, granting Release-Engineers push permission for
|There are fundamentally two types of permissions:
|* Write permissions (who can vote, push, submit etc.)
|* Read permissions (who can see data)
|Read permissions need special treatment across Gerrit, because Gerrit
|should only surface data (including repository existence) if a user
|has read permission. This means that
|* The git wire protocol support must omit references from
|advertisement if the user lacks read permissions
|* Uploads through the git wire protocol must refuse commits that are
|based on SHA-1s for data that the user can't see.
|* Tags are only visible if their commits are visible to user through a
|Metadata (eg. OAuth credentials) is also stored in Git. Existing
|endpoints must refuse creating branches or changes that expose these
|metadata or allow changes to them.
|Almost all data is stored as Git, but Git only supports fast lookup by
|SHA-1 or by ref (branch) name. Therefore Gerrit also has an indexing
|system (powered by Lucene by default) for other types of queries.
|There are 4 indices:
|* Project index - find repositories by name, parent project, etc.
|* Account index - find accounts by name, email, etc.
|* Group index - find groups by name, owner, description etc.
|* Change index - find changes by file, status, modification date etc.
|The base entities are characterized by SHA-1s. Storing the
|characterizing SHA-1s allows detection of stale index entries.
|== Plug-in architecture
|Gerrit has a plug-in architecture. Plugins can be installed by
|dropping them into $site_directory/plugins, or at runtime through
|plugin SSH commands, or the plugin REST API.
|=== Backend plugins
|At runtime, code can be loaded from a `.jar` file. This code can hook
|into predefined extension points. A common use of plugins is to have
|Gerrit interoperate with site-specific tools, such as CI-systems or
|// list some notable extension points, and notable plugins
|// link to plugin development
|Some backend plugins expose the JVM for scripting use (eg. Groovy,
|Scala), so plugins can be written without having to setup a Java
|// Luca to expand: how do script plugins load their scripts?
|=== Frontend plugins
|The UI can be extended using Frontend plugins. This is useful for
|changing the look & feel of Gerrit, but it can also be used to surface
|data from systems that aren't integrated with the Gerrit backend, eg.
|CI systems or code coverage providers.
|// FE team to write a bit more:
|// * how to load ?
|// * XSRF, CORS ?
|== Internationalization and Localization
|As a source code review system for open source projects, where the
|commonly preferred language for communication is typically English,
|Gerrit does not make internationalization or localization a priority.
|The majority of Gerrit's users will be writing change descriptions
|and comments in English, and therefore an English user interface
|is usable by the target user base.
|== Accessibility Considerations
|// UI team to rewrite this.
|Whenever possible Gerrit displays raw text rather than image icons,
|so screen readers should still be able to provide useful information
|to blind persons accessing Gerrit sites.
|Standard HTML hyperlinks are used rather than HTML div or span tags
|with click listeners. This provides two benefits to the end-user.
|The first benefit is that screen readers are optimized to locating
|standard hyperlink anchors and presenting them to the end-user as
|a navigation action. The second benefit is that users can use
|the 'open in new tab/window' feature of their browser whenever
|When possible, Gerrit uses the ARIA properties on DOM widgets to
|provide hints to screen readers.
|== Browser Compatibility
|// UI team to add section on minimum browser requirements.
|no server side rendering fallbacks, the browser must support modern
|Dumb clients such as `lynx`, `wget`, `curl`, or even many search engine
|spiders are not able to access Gerrit content.
|All of the content stored within Gerrit is also available through
|other means, such as gitweb or the `git://` protocol. Any existing
|search engine crawlers can index the server-side HTML served by a code
|browser, and thus can index the majority of the changes which might
|appear in Gerrit. Therefore the lack of support for most search engine
|crawlers is a non-issue for most Gerrit deployments.
|== Product Integration
|Gerrit optionally surfaces links to HTML pages in a code browser. The
|links are configurable, and Gerrit comes with a built-in code browser,
|Gerrit integrates with some types of corporate single-sign-on (SSO)
|solutions, typically by having the SSO authentication be performed
|in a reverse proxy web server and then blindly trusting that all
|incoming connections have been authenticated by that reverse proxy.
|When configured to use this form of authentication, Gerrit does
|not integrate with OpenID providers.
|When installing Gerrit, administrators may optionally include an
|HTML header or footer snippet which may include user tracking code,
|such as that used by Google Analytics. This is a per-instance
|configuration that must be done by hand, and is not supported
|out of the box. Other site trackers instead of Google Analytics
|Gerrit does not integrate with any Google service, or any other
|services other than those listed above.
|Plugins (see above) can be used to drive product integrations from the
|Gerrit side. Products that support Gerrit explicitly can use the REST
|API or the SSH API to contact Gerrit.
|== Privacy Considerations
|Gerrit stores the following information per user account:
|* Full Name
|* Preferred Email Address
|The full name and preferred email address fields are shown to any
|site visitor viewing a page containing a change uploaded by the
|account owner, or containing a published comment written by the
|Showing the full name and preferred email is approximately the same
|risk as the `From` header of an email posted to a public mailing
|list that maintains archives, and Gerrit treats these fields in
|much the same way that a mailing list archive might handle them.
|Users who don't want to expose this information should either not
|participate in a Gerrit based online community, or open a new email
|address dedicated for this use.
|As the Gerrit UI data is only available through XSRF protected
|JSON-RPC calls, "screen-scraping" for email addresses is difficult,
|but not impossible. It is unlikely a spammer will go through the
|effort required to code a custom scraping application necessary
|to cull email addresses from published Gerrit comments. In most
|cases these same addresses would be more easily obtained from the
|project's mailing list archives.
|The user's name and email address is stored unencrypted in the
|== Spam and Abuse Considerations
|There is no spam protection for the Git protocol upload path.
|Uploading a change successfully requires a pre-existing account, and a
|lot of up-front effort.
|Gerrit makes no attempt to detect spam changes or comments in the web
|UI. To post and publish a comment a client must sign in and then use
|the XSRF protected JSON-RPC interface to publish the draft on an
|existing change record.
|Absence of SPAM handling is based upon the idea that Gerrit caters to
|a niche audience, and will therefore be unattractive to spammers. In
|addition, it is not a factor for corporate, on-premise deployments.
|Gerrit supports the Git wire protocol, and an API (one API for HTTP,
|and one for SSH).
|The git wire protocol does a client/server negotiation to avoid
|sending too much data. This negotation occupies a CPU, so the number
|of concurrent push/fetch operations should be capped by the number of
|Clients on slow network connections may be network bound rather than
|server side CPU bound, in which case a core may be effectively shared
|with another user. Possible core sharing due to network bottlenecks
|generally holds true for network connections running below 10 MiB/sec.
|Deployments for large, distributed companies can replicate Git data to
|read-only replicas to offload fetch traffic. The read-only replicas
|should also serve this data using Gerrit to ensure that permissions
|The API serves requests of varying costs. Requests that originate in
|the UI can block productivity, so care has been taken to optimize
|these for latency, using the following techniques:
|* Async calls: the UI becomes responsive before some UI elements
|* Caching: metadata is stored in Git, which is relatively expensive to
|access. This is sped up by multiple caches. Metadata entities are
|stored in Git, and can therefore be seen as immutable values keyed
|by SHA-1, which is very amenable to caching. All SHA-1 keyed caches
|can be persisted on local disk.
|The size (memory, disk) of these caches should be adapted to the
|instance size (number of users, size and quantity of repositories)
|for optimal performance.
|Git does not impose fundamental limits (eg. number of files per
|change) on data. To ensure stability, Gerrit configures a number of
|default limits for these.
|// add a link to the default settings.
|=== Scaling team size
|A team of size N has N^2 possible interactions. As a result, features
|that expose interactions with activities of other team members has a
|quadratic cost in aggregate. The following features scale poorly with
|large team sizes:
|* the change screen shows conflicting changes by default. This data is
|cached, but updates to pending changes cause cache misses. For a
|single change, the amount of work is proportional to the number of
|pending changes, so in aggregate, the cost of this feature is
|quadratic in the team size.
|* the change screen shows if a change is mergeable to the target
|branch. If the target branch moves quickly (large developer team),
|this causes cache misses. In aggregate, the cost of this feature is
|Both features should be turned off for repositories that involve 1000s
|=== Browser performance
|// say something about browser performance tuning.
|=== Real life numbers
|Gerrit is designed for very large projects, both open source and
|proprietary commercial projects. For a single Gerrit process, the
|following limits are known to work:
||Parameter | Maximum | Deployment
||Projects | 50,000 | gerrithub.io
||Contributors | 150,000 | eclipse.org
||Bytes/repo | 100G | Qualcomm internal
||Changes/repo | 300k | Qualcomm internal
||Revisions/Change | 300 | Qualcomm internal
||Reviewers/Change | 87 | Qualcomm internal
|// find some numbers for these stats:
|// |Files/repo | ? |
|// |Files/Change | ? |
|// |Comments/Change | ? |
|// |max QPS/CPU | ? |
|Google runs a horizontally scaled deployment. We have seen the
|following per-JVM maximums:
|.Observed maximums (googlesource.com)
||Parameter | Maximum | Deployment
||Files/repo | 500,000 | chromium-review
||Bytes/repo | 12G | chromium-review
||Changes/repo | 500k | chromium-review
||Revisions/Change | 1900 | chromium-review
||Files/Change | 10,000| android-review
||Comments/Change | 1,200 | chromium-review
|== Redundancy & Reliability
|Gerrit is structured as a single JVM process, reading and writing to a
|single file system. If there are hardware failures in the machine
|running the JVM, or the storage holding the repositories, there is no
|recourse; on failure, errors will be returned to the client.
|Deployments needing more stringent uptime guarantees can use
|replication/multi-master setup, which ensures availability and
|geographical distribution, at the cost of slower write actions.
|// TODO: link.
|Using the standard replication plugin, Gerrit can be configured
|to replicate changes made to the local Git repositories over any
|standard Git transports. After the plugin is installed, remote
|destinations can be configured in `'$site_path'/etc/replication.conf`
|to send copies of all changes over SSH to other servers, or to the
|Amazon S3 blob storage service.
|== Logging Plan
|Gerrit stores Apache style HTTPD logs, as well as ERROR/INFO messages
|from the Java logger, under `$site_dir/logs/`.
|Published comments contain a publication date, so users can judge
|when the comment was posted and decide if it was "recent" or not.
|Only the timestamp is stored in the database, the IP address of
|the comment author is not stored.
|Changes uploaded over the SSH daemon from `git push` have the
|standard Git reflog updated with the date and time that the upload
|occurred, and the Gerrit account identity of who did the upload.
|Changes submitted and merged into a branch also update the
|Git reflog. These logs are available only to the Gerrit site
|administrator, and they are not replicated through the automatic
|replication noted earlier. These logs are primarily recorded for an
|"oh s**t" moment where the administrator has to rewind data. In most
|installations they are a waste of disk space. Future versions of
|JGit may allow disabling these logs, and Gerrit may take advantage
|of that feature to stop writing these logs.
|A web server positioned in front of Gerrit (such as a reverse proxy)
|or the hosting servlet container may record access logs, and these
|logs may be mined for usage information. This is outside of the
|scope of Gerrit.
|Part of link:index.html[Gerrit Code Review]