|  | :linkattrs: | 
|  | = Gerrit Code Review - System Design | 
|  |  | 
|  | == Objective | 
|  |  | 
|  | Gerrit is a web based code review system, facilitating online code | 
|  | reviews for projects using the Git version control system. | 
|  |  | 
|  | Gerrit makes reviews easier by showing changes in a side-by-side | 
|  | display, and allowing inline/file comments to be added by any reviewer. | 
|  |  | 
|  | Gerrit simplifies Git based project maintainership by permitting | 
|  | any authorized user to submit changes to the master Git repository, | 
|  | rather than requiring all approved changes to be merged in by | 
|  | hand by the project maintainer.  This functionality enables a more | 
|  | centralized usage of Git. | 
|  |  | 
|  |  | 
|  | == Background | 
|  |  | 
|  | Git is a distributed version control system, wherein each repository | 
|  | is assumed to be owned/maintained by a single user.  There are no | 
|  | inherent security controls built into Git, so the ability to read | 
|  | from or write to a repository is controlled entirely by the host's | 
|  | filesystem or network access controls. | 
|  |  | 
|  | The objective of Gerrit is to facilitate Git development by larger | 
|  | teams: it provides a means to enforce organizational policies around | 
|  | code submissions, eg. "all code must be reviewed by another | 
|  | developer", "all code shall pass tests". It achieves this by | 
|  |  | 
|  | * providing fine-grained (per-branch, per-repository, inheriting) | 
|  | access controls, which allow a Gerrit admin to delegate permissions | 
|  | to different team(-lead)s. | 
|  |  | 
|  | * facilitate code review: Gerrit offers a web view of pending code | 
|  | changes, that allows for easy reading and commenting by humans. The | 
|  | web view can offer data coming out of automated QA processes (eg. | 
|  | CI). The permission system also includes fine grained control of who | 
|  | can approve pending changes for submission to further facilitate | 
|  | delegation of code ownership. | 
|  |  | 
|  | == Overview | 
|  |  | 
|  | Developers create one or more changes on their local desktop system, | 
|  | then upload them for review to Gerrit using the standard `git push` | 
|  | command line program, or any GUI which can invoke `git push` on behalf | 
|  | of the user. Authentication and data transfer are handled through SSH | 
|  | and HTTPS. Uploads are protected by the authentication, | 
|  | confidentiality and integrity offered by the transport (SSH, HTTPS). | 
|  |  | 
|  | Each Git commit created on the client desktop system is converted into | 
|  | a unique change record which can be reviewed independently. | 
|  |  | 
|  | A summary of each newly uploaded change is automatically emailed | 
|  | to reviewers, so they receive a direct hyperlink to review the | 
|  | change on the web.  Reviewer email addresses can be specified on the | 
|  | `git push` command line, but typically reviewers are added in the web | 
|  | interface. | 
|  |  | 
|  | Reviewers use the web interface to read the side-by-side or unified | 
|  | diff of a change, and insert draft inline/file comments where | 
|  | appropriate. A draft comment is visible only to the reviewer, until | 
|  | they publish those comments.  Published comments are automatically | 
|  | emailed to the change author by Gerrit, and are CC'd to all other | 
|  | reviewers who have already commented on the change. | 
|  |  | 
|  | Reviewers can score the change ("vote"), indicating whether they feel the | 
|  | change is ready for inclusion in the project, needs more work, or | 
|  | should be rejected outright. These scores provide direct feedback to | 
|  | Gerrit's change submit function. | 
|  |  | 
|  | After a change has been scored positively by reviewers, Gerrit enables | 
|  | a submit button on the web interface. Authorized users can push the | 
|  | submit button to have the change enter the project repository. The | 
|  | user pressing the submit button does not need to be the author of the | 
|  | change. | 
|  |  | 
|  |  | 
|  | == Infrastructure | 
|  |  | 
|  | End-user web browsers make HTTP requests directly to Gerrit's | 
|  | HTTP server. As nearly all of the Gerrit user interface is implemented | 
|  | in a JavaScript based web app, the majority of these requests are | 
|  | transmitting compressed JSON payloads, with all HTML being generated | 
|  | within the browser. | 
|  |  | 
|  | Gerrit's HTTP server side component is implemented as a standard Java | 
|  | servlet, and thus runs within any link:install-j2ee.html[J2EE servlet | 
|  | container]. The standard install will run inside Jetty, which is | 
|  | included in the binary. | 
|  |  | 
|  | End-user uploads are performed over SSH or HTTP, so Gerrit's servlets | 
|  | also start up a background thread to receive SSH connections through | 
|  | an independent SSH port. SSH clients communicate directly with this | 
|  | port, bypassing the HTTP server used by browsers. | 
|  |  | 
|  | User authentication is handled by identity realms. Gerrit supports the | 
|  | following types of authentication: | 
|  |  | 
|  | * OpenID (see link:http://openid.net/developers/specs/[OpenID Specifications,role=external,window=_blank]) | 
|  | * OAuth2 | 
|  | * LDAP | 
|  | * Google accounts (on googlesource.com) | 
|  | * SAML | 
|  | * Kerberos | 
|  | * 3rd party SSO | 
|  |  | 
|  | === NoteDb | 
|  |  | 
|  | Server side data storage for Gerrit is broken down into two different | 
|  | categories: | 
|  |  | 
|  | * Git repository data | 
|  | * Gerrit metadata | 
|  |  | 
|  | The Git repository data is the Git object database used to store | 
|  | already submitted revisions, as well as all uploaded (proposed) | 
|  | changes.  Gerrit uses the standard Git repository format, and | 
|  | therefore requires direct filesystem access to the repositories. | 
|  | All repository data is stored in the filesystem and accessed through | 
|  | the JGit library.  Repository data can be stored on remote servers | 
|  | accessible through NFS or SMB, but the remote directory must | 
|  | be mounted on the Gerrit server as part of the local filesystem | 
|  | namespace.  Remote filesystems are likely to perform worse than | 
|  | local ones, due to Git disk IO behavior not being optimized for | 
|  | remote access. | 
|  |  | 
|  | The Gerrit metadata contains a summary of the available changes, all | 
|  | comments (published and drafts), and individual user account | 
|  | information. | 
|  |  | 
|  | Gerrit metadata is also stored in Git, with the commits marking the | 
|  | historical state of metadata. Data is stored in the trees associated | 
|  | with the commits, typically using Git config file or JSON as the base | 
|  | format. For metadata, there are 3 types of data: changes, accounts and | 
|  | groups. | 
|  |  | 
|  | Accounts are stored in a special Git repository `All-Users`. | 
|  |  | 
|  | Accounts can be grouped in groups. Gerrit has a built-in group system, | 
|  | but can also interface to external group system (eg. Google groups, | 
|  | LDAP). The built-in groups are stored in `All-Users`. | 
|  |  | 
|  | Draft comments are stored in `All-Users` too. | 
|  |  | 
|  | Permissions are stored in Git, in a branch `refs/meta/config` for the | 
|  | repository. Repository configuration (including permissions) supports | 
|  | single inheritance, with the `All-Projects` repository containing | 
|  | site-wide defaults. | 
|  |  | 
|  | Code review metadata is stored in Git, alongside the code under | 
|  | review. Metadata includes change status, votes, comments. This review | 
|  | metadata is stored in NoteDb along with the submitted code and code | 
|  | under review. Hence, the review history can be exported with `git | 
|  | clone --mirror` by anyone with sufficient permissions. | 
|  |  | 
|  | == Permissions | 
|  |  | 
|  | Permissions are specified on branch names, and given to groups. For | 
|  | example, | 
|  |  | 
|  | ``` | 
|  | [access "refs/heads/stable/*"] | 
|  | push = group Release-Engineers | 
|  | ``` | 
|  |  | 
|  | this provides a rule, granting Release-Engineers push permission for | 
|  | stable branches. | 
|  |  | 
|  | There are fundamentally two types of permissions: | 
|  |  | 
|  | * Write permissions (who can vote, push, submit etc.) | 
|  |  | 
|  | * Read permissions (who can see data) | 
|  |  | 
|  | Read permissions need special treatment across Gerrit, because Gerrit | 
|  | should only surface data (including repository existence) if a user | 
|  | has read permission. This means that | 
|  |  | 
|  | * The git wire protocol support must omit references from | 
|  | advertisement if the user lacks read permissions | 
|  |  | 
|  | * Uploads through the git wire protocol must refuse commits that are | 
|  | based on SHA-1s for data that the user can't see. | 
|  |  | 
|  | * Tags are only visible if their commits are visible to user through a | 
|  | non-tag reference. | 
|  |  | 
|  | Metadata (eg. OAuth credentials) is also stored in Git. Existing | 
|  | endpoints must refuse creating branches or changes that expose these | 
|  | metadata or allow changes to them. | 
|  |  | 
|  |  | 
|  | === Indexing | 
|  |  | 
|  | Almost all data is stored as Git, but Git only supports fast lookup by | 
|  | SHA-1 or by ref (branch) name. Therefore Gerrit also has an indexing | 
|  | system (powered by Lucene by default) for other types of queries. | 
|  | There are 4 indices: | 
|  |  | 
|  | * Project index - find repositories by name, parent project, etc. | 
|  | * Account index - find accounts by name, email, etc. | 
|  | * Group index - find groups by name, owner, description etc. | 
|  | * Change index - find changes by file, status, modification date etc. | 
|  |  | 
|  | The base entities are characterized by SHA-1s. Storing the | 
|  | characterizing SHA-1s allows detection of stale index entries. | 
|  |  | 
|  | == Plug-in architecture | 
|  |  | 
|  | Gerrit has a plug-in architecture. Plugins can be installed by | 
|  | dropping them into $site_directory/plugins, or at runtime through | 
|  | plugin SSH commands, or the plugin REST API. | 
|  |  | 
|  | === Backend plugins | 
|  |  | 
|  | At runtime, code can be loaded from a `.jar` file. This code can hook | 
|  | into predefined extension points. A common use of plugins is to have | 
|  | Gerrit interoperate with site-specific tools, such as CI-systems or | 
|  | issue trackers. | 
|  |  | 
|  | // list some notable extension points, and notable plugins | 
|  | // link to plugin development | 
|  |  | 
|  | Some backend plugins expose the JVM for scripting use (eg. Groovy, | 
|  | Scala), so plugins can be written without having to setup a Java | 
|  | development environment. | 
|  |  | 
|  | // Luca to expand: how do script plugins load their scripts? | 
|  |  | 
|  | === Frontend plugins | 
|  |  | 
|  | The UI can be extended using Frontend plugins. This is useful for | 
|  | changing the look & feel of Gerrit, but it can also be used to surface | 
|  | data from systems that aren't integrated with the Gerrit backend, eg. | 
|  | CI systems or code coverage providers. | 
|  |  | 
|  | // FE team to write a bit more: | 
|  | // * how to load ? | 
|  | // * XSRF, CORS ? | 
|  |  | 
|  | == Internationalization and Localization | 
|  |  | 
|  | As a source code review system for open source projects, where the | 
|  | commonly preferred language for communication is typically English, | 
|  | Gerrit does not make internationalization or localization a priority. | 
|  |  | 
|  | The majority of Gerrit's users will be writing change descriptions | 
|  | and comments in English, and therefore an English user interface | 
|  | is usable by the target user base. | 
|  |  | 
|  |  | 
|  | == Accessibility Considerations | 
|  |  | 
|  | // UI team to rewrite this. | 
|  |  | 
|  | Whenever possible Gerrit displays raw text rather than image icons, | 
|  | so screen readers should still be able to provide useful information | 
|  | to blind persons accessing Gerrit sites. | 
|  |  | 
|  | Standard HTML hyperlinks are used rather than HTML div or span tags | 
|  | with click listeners.  This provides two benefits to the end-user. | 
|  | The first benefit is that screen readers are optimized to locating | 
|  | standard hyperlink anchors and presenting them to the end-user as | 
|  | a navigation action.  The second benefit is that users can use | 
|  | the 'open in new tab/window' feature of their browser whenever | 
|  | they choose. | 
|  |  | 
|  | When possible, Gerrit uses the ARIA properties on DOM widgets to | 
|  | provide hints to screen readers. | 
|  |  | 
|  |  | 
|  | == Browser Compatibility | 
|  |  | 
|  | Gerrit requires a JavaScript enabled browser. | 
|  |  | 
|  | // UI team to add section on minimum browser requirements. | 
|  |  | 
|  | As Gerrit is a pure JavaScript application on the client side, with | 
|  | no server side rendering fallbacks, the browser must support modern | 
|  | JavaScript semantics in order to access the Gerrit web application. | 
|  | Dumb clients such as `lynx`, `wget`, `curl`, or even many search engine | 
|  | spiders are not able to access Gerrit content. | 
|  |  | 
|  | All of the content stored within Gerrit is also available through | 
|  | other means, such as gitweb or the `git://` protocol. Any existing | 
|  | search engine crawlers can index the server-side HTML served by a code | 
|  | browser, and thus can index the majority of the changes which might | 
|  | appear in Gerrit. Therefore the lack of support for most search engine | 
|  | crawlers is a non-issue for most Gerrit deployments. | 
|  |  | 
|  |  | 
|  | == Product Integration | 
|  |  | 
|  | Gerrit optionally surfaces links to HTML pages in a code browser. The | 
|  | links are configurable, and Gerrit comes with a built-in code browser, | 
|  | called Gitiles. | 
|  |  | 
|  | Gerrit integrates with some types of corporate single-sign-on (SSO) | 
|  | solutions, typically by having the SSO authentication be performed | 
|  | in a reverse proxy web server and then blindly trusting that all | 
|  | incoming connections have been authenticated by that reverse proxy. | 
|  | When configured to use this form of authentication, Gerrit does | 
|  | not integrate with OpenID providers. | 
|  |  | 
|  | When installing Gerrit, administrators may optionally include an | 
|  | HTML header or footer snippet which may include user tracking code, | 
|  | such as that used by Google Analytics.  This is a per-instance | 
|  | configuration that must be done by hand, and is not supported | 
|  | out of the box.  Other site trackers instead of Google Analytics | 
|  | can be used, as the administrator can supply any HTML/JavaScript | 
|  | they choose. | 
|  |  | 
|  | Gerrit does not integrate with any Google service, or any other | 
|  | services other than those listed above. | 
|  |  | 
|  | Plugins (see above) can be used to drive product integrations from the | 
|  | Gerrit side. Products that support Gerrit explicitly can use the REST | 
|  | API or the SSH API to contact Gerrit. | 
|  |  | 
|  |  | 
|  | == Privacy Considerations | 
|  |  | 
|  | Gerrit stores the following information per user account: | 
|  |  | 
|  | * Full Name | 
|  | * Preferred Email Address | 
|  |  | 
|  | The full name and preferred email address fields are shown to any | 
|  | site visitor viewing a page containing a change uploaded by the | 
|  | account owner, or containing a published comment written by the | 
|  | account owner. | 
|  |  | 
|  | Showing the full name and preferred email is approximately the same | 
|  | risk as the `From` header of an email posted to a public mailing | 
|  | list that maintains archives, and Gerrit treats these fields in | 
|  | much the same way that a mailing list archive might handle them. | 
|  | Users who don't want to expose this information should either not | 
|  | participate in a Gerrit based online community, or open a new email | 
|  | address dedicated for this use. | 
|  |  | 
|  | As the Gerrit UI data is only available through XSRF protected | 
|  | JSON-RPC calls, "screen-scraping" for email addresses is difficult, | 
|  | but not impossible.  It is unlikely a spammer will go through the | 
|  | effort required to code a custom scraping application necessary | 
|  | to cull email addresses from published Gerrit comments.  In most | 
|  | cases these same addresses would be more easily obtained from the | 
|  | project's mailing list archives. | 
|  |  | 
|  | The user's name and email address is stored unencrypted in the | 
|  | link:config-accounts.html#all-users[All-Users] repository. | 
|  |  | 
|  | == Spam and Abuse Considerations | 
|  |  | 
|  | There is no spam protection for the Git protocol upload path. | 
|  | Uploading a change successfully requires a pre-existing account, and a | 
|  | lot of up-front effort. | 
|  |  | 
|  | Gerrit makes no attempt to detect spam changes or comments in the web | 
|  | UI. To post and publish a comment a client must sign in and then use | 
|  | the XSRF protected JSON-RPC interface to publish the draft on an | 
|  | existing change record. | 
|  |  | 
|  | Absence of SPAM handling is based upon the idea that Gerrit caters to | 
|  | a niche audience, and will therefore be unattractive to spammers. In | 
|  | addition, it is not a factor for corporate, on-premise deployments. | 
|  |  | 
|  |  | 
|  | == Scalability | 
|  |  | 
|  | Gerrit supports the Git wire protocol, and an API (one API for HTTP, | 
|  | and one for SSH). | 
|  |  | 
|  | The git wire protocol does a client/server negotiation to avoid | 
|  | sending too much data. This negotiation occupies a CPU, so the number | 
|  | of concurrent push/fetch operations should be capped by the number of | 
|  | CPUs. | 
|  |  | 
|  | Clients on slow network connections may be network bound rather than | 
|  | server side CPU bound, in which case a core may be effectively shared | 
|  | with another user. Possible core sharing due to network bottlenecks | 
|  | generally holds true for network connections running below 10 MiB/sec. | 
|  |  | 
|  | Deployments for large, distributed companies can replicate Git data to | 
|  | read-only replicas to offload fetch traffic. The read-only replicas | 
|  | should also serve this data using Gerrit to ensure that permissions | 
|  | are obeyed. | 
|  |  | 
|  | The API serves requests of varying costs. Requests that originate in | 
|  | the UI can block productivity, so care has been taken to optimize | 
|  | these for latency, using the following techniques: | 
|  |  | 
|  | * Async calls: the UI becomes responsive before some UI elements | 
|  | finished loading | 
|  |  | 
|  | * Caching: metadata is stored in Git, which is relatively expensive to | 
|  | access. This is sped up by multiple caches. Metadata entities are | 
|  | stored in Git, and can therefore be seen as immutable values keyed | 
|  | by SHA-1, which is very amenable to caching. All SHA-1 keyed caches | 
|  | can be persisted on local disk. | 
|  |  | 
|  | The size (memory, disk) of these caches should be adapted to the | 
|  | instance size (number of users, size and quantity of repositories) | 
|  | for optimal performance. | 
|  |  | 
|  | Git does not impose fundamental limits (eg. number of files per | 
|  | change) on data. To ensure stability, Gerrit configures a number of | 
|  | default limits for these. | 
|  |  | 
|  | // add a link to the default settings. | 
|  |  | 
|  | === Scaling team size | 
|  |  | 
|  | A team of size N has N^2 possible interactions. As a result, features | 
|  | that expose interactions with activities of other team members has a | 
|  | quadratic cost in aggregate. The following features scale poorly with | 
|  | large team sizes: | 
|  |  | 
|  | * the change screen shows conflicting changes by default. This data is | 
|  | cached, but updates to pending changes cause cache misses. For a | 
|  | single change, the amount of work is proportional to the number of | 
|  | pending changes, so in aggregate, the cost of this feature is | 
|  | quadratic in the team size. | 
|  |  | 
|  | * the change screen shows if a change is mergeable to the target | 
|  | branch. If the target branch moves quickly (large developer team), | 
|  | this causes cache misses. In aggregate, the cost of this feature is | 
|  | also quadratic. | 
|  |  | 
|  | Both features should be turned off for repositories that involve 1000s | 
|  | of developers. | 
|  |  | 
|  | === Browser performance | 
|  |  | 
|  | // say something about browser performance tuning. | 
|  |  | 
|  | === Real life numbers | 
|  |  | 
|  |  | 
|  | Gerrit is designed for very large projects, both open source and | 
|  | proprietary commercial projects. For a single Gerrit process, the | 
|  | following limits are known to work: | 
|  |  | 
|  | .Observed maximums | 
|  | [options="header"] | 
|  | |====================================================== | 
|  | |Parameter        |         Maximum | Deployment | 
|  | |Projects         |         50,000  | gerrithub.io | 
|  | |Contributors     |        150,000  | eclipse.org | 
|  | |Bytes/repo       |        100G     | Qualcomm internal | 
|  | |Changes/repo     |        300k     | Qualcomm internal | 
|  | |Revisions/Change |        300      | Qualcomm internal | 
|  | |Reviewers/Change |        87       | Qualcomm internal | 
|  | |====================================================== | 
|  |  | 
|  |  | 
|  | // find some numbers for these stats: | 
|  | // |Files/repo       |        ? | | 
|  | // |Files/Change     |        ? | | 
|  | // |Comments/Change  |        ? | | 
|  | // |max QPS/CPU      |        ? | | 
|  |  | 
|  |  | 
|  | Google runs a horizontally scaled deployment. We have seen the | 
|  | following per-JVM maximums: | 
|  |  | 
|  | .Observed maximums (googlesource.com) | 
|  | [options="header"] | 
|  | |====================================================== | 
|  | |Parameter        |         Maximum | Deployment | 
|  | |Files/repo       |        500,000  | chromium-review | 
|  | |Bytes/repo       |         12G     | chromium-review | 
|  | |Changes/repo     |          500k   | chromium-review | 
|  | |Revisions/Change |          1900   | chromium-review | 
|  | |Files/Change     |           10,000| android-review | 
|  | |Comments/Change  |           1,200 | chromium-review | 
|  | |====================================================== | 
|  |  | 
|  |  | 
|  | == Redundancy & Reliability | 
|  |  | 
|  | Gerrit is structured as a single JVM process, reading and writing to a | 
|  | single file system. If there are hardware failures in the machine | 
|  | running the JVM, or the storage holding the repositories, there is no | 
|  | recourse; on failure, errors will be returned to the client. | 
|  |  | 
|  | Deployments needing more stringent uptime guarantees can use | 
|  | replication/multi-master setup, which ensures availability and | 
|  | geographical distribution, at the cost of slower write actions. | 
|  |  | 
|  | // TODO: link. | 
|  |  | 
|  | === Backups | 
|  |  | 
|  | Using the standard replication plugin, Gerrit can be configured | 
|  | to replicate changes made to the local Git repositories over any | 
|  | standard Git transports. After the plugin is installed, remote | 
|  | destinations can be configured in `'$site_path'/etc/replication.conf` | 
|  | to send copies of all changes over SSH to other servers, or to the | 
|  | Amazon S3 blob storage service. | 
|  |  | 
|  |  | 
|  | == Logging Plan | 
|  |  | 
|  | Gerrit stores Apache style HTTPD logs, as well as ERROR/INFO messages | 
|  | from the Java logger, under `$site_dir/logs/`. | 
|  |  | 
|  | Published comments contain a publication date, so users can judge | 
|  | when the comment was posted and decide if it was "recent" or not. | 
|  | Only the timestamp is stored in the database, the IP address of | 
|  | the comment author is not stored. | 
|  |  | 
|  | Changes uploaded over the SSH daemon from `git push` have the | 
|  | standard Git reflog updated with the date and time that the upload | 
|  | occurred, and the Gerrit account identity of who did the upload. | 
|  | Changes submitted and merged into a branch also update the | 
|  | Git reflog.  These logs are available only to the Gerrit site | 
|  | administrator, and they are not replicated through the automatic | 
|  | replication noted earlier.  These logs are primarily recorded for an | 
|  | "oh s**t" moment where the administrator has to rewind data.  In most | 
|  | installations they are a waste of disk space.  Future versions of | 
|  | JGit may allow disabling these logs, and Gerrit may take advantage | 
|  | of that feature to stop writing these logs. | 
|  |  | 
|  | A web server positioned in front of Gerrit (such as a reverse proxy) | 
|  | or the hosting servlet container may record access logs, and these | 
|  | logs may be mined for usage information.  This is outside of the | 
|  | scope of Gerrit. | 
|  |  | 
|  |  | 
|  | GERRIT | 
|  | ------ | 
|  | Part of link:index.html[Gerrit Code Review] | 
|  |  | 
|  | SEARCHBOX | 
|  | --------- |