Documentation/dev-design.txt - gerrit - Git at Google

 :linkattrs:
 = Gerrit Code Review - System Design

 == Objective

 Gerrit is a web based code review system, facilitating online code
 reviews for projects using the Git version control system.

 Gerrit makes reviews easier by showing changes in a side-by-side
 display, and allowing inline/file comments to be added by any reviewer.

 Gerrit simplifies Git based project maintainership by permitting
 any authorized user to submit changes to the master Git repository,
 rather than requiring all approved changes to be merged in by
 hand by the project maintainer.  This functionality enables a more
 centralized usage of Git.


 == Background

 Git is a distributed version control system, wherein each repository
 is assumed to be owned/maintained by a single user.  There are no
 inherent security controls built into Git, so the ability to read
 from or write to a repository is controlled entirely by the host's
 filesystem or network access controls.

 The objective of Gerrit is to facilitate Git development by larger
 teams: it provides a means to enforce organizational policies around
 code submissions, eg. "all code must be reviewed by another
 developer", "all code shall pass tests". It achieves this by

 * providing fine-grained (per-branch, per-repository, inheriting)
   access controls, which allow a Gerrit admin to delegate permissions
   to different team(-lead)s.

 * facilitate code review: Gerrit offers a web view of pending code
   changes, that allows for easy reading and commenting by humans. The
   web view can offer data coming out of automated QA processes (eg.
   CI). The permission system also includes fine grained control of who
   can approve pending changes for submission to further facilitate
   delegation of code ownership.

 == Overview

 Developers create one or more changes on their local desktop system,
 then upload them for review to Gerrit using the standard `git push`
 command line program, or any GUI which can invoke `git push` on behalf
 of the user. Authentication and data transfer are handled through SSH
 and HTTPS. Uploads are protected by the authentication,
 confidentiality and integrity offered by the transport (SSH, HTTPS).

 Each Git commit created on the client desktop system is converted into
 a unique change record which can be reviewed independently.

 A summary of each newly uploaded change is automatically emailed
 to reviewers, so they receive a direct hyperlink to review the
 change on the web.  Reviewer email addresses can be specified on the
 `git push` command line, but typically reviewers are added in the web
 interface.

 Reviewers use the web interface to read the side-by-side or unified
 diff of a change, and insert draft inline/file comments where
 appropriate. A draft comment is visible only to the reviewer, until
 they publish those comments.  Published comments are automatically
 emailed to the change author by Gerrit, and are CC'd to all other
 reviewers who have already commented on the change.

 Reviewers can score the change ("vote"), indicating whether they feel the
 change is ready for inclusion in the project, needs more work, or
 should be rejected outright. These scores provide direct feedback to
 Gerrit's change submit function.

 After a change has been scored positively by reviewers, Gerrit enables
 a submit button on the web interface. Authorized users can push the
 submit button to have the change enter the project repository. The
 user pressing the submit button does not need to be the author of the
 change.


 == Infrastructure

 End-user web browsers make HTTP requests directly to Gerrit's
 HTTP server. As nearly all of the Gerrit user interface is implemented
 in a JavaScript based web app, the majority of these requests are
 transmitting compressed JSON payloads, with all HTML being generated
 within the browser.

 Gerrit's HTTP server side component is implemented as a standard Java
 servlet, and thus runs within any link:install-j2ee.html[J2EE servlet
 container]. The standard install will run inside Jetty, which is
 included in the binary.

 End-user uploads are performed over SSH or HTTP, so Gerrit's servlets
 also start up a background thread to receive SSH connections through
 an independent SSH port. SSH clients communicate directly with this
 port, bypassing the HTTP server used by browsers.

 User authentication is handled by identity realms. Gerrit supports the
 following types of authentication:

 * OpenID (see link:http://openid.net/developers/specs/[OpenID Specifications,role=external,window=_blank])
 * OAuth2
 * LDAP
 * Google accounts (on googlesource.com)
 * SAML
 * Kerberos
 * 3rd party SSO

 === NoteDb

 Server side data storage for Gerrit is broken down into two different
 categories:

 * Git repository data
 * Gerrit metadata

 The Git repository data is the Git object database used to store
 already submitted revisions, as well as all uploaded (proposed)
 changes.  Gerrit uses the standard Git repository format, and
 therefore requires direct filesystem access to the repositories.
 All repository data is stored in the filesystem and accessed through
 the JGit library.  Repository data can be stored on remote servers
 accessible through NFS or SMB, but the remote directory must
 be mounted on the Gerrit server as part of the local filesystem
 namespace.  Remote filesystems are likely to perform worse than
 local ones, due to Git disk IO behavior not being optimized for
 remote access.

 The Gerrit metadata contains a summary of the available changes, all
 comments (published and drafts), and individual user account
 information.

 Gerrit metadata is also stored in Git, with the commits marking the
 historical state of metadata. Data is stored in the trees associated
 with the commits, typically using Git config file or JSON as the base
 format. For metadata, there are 3 types of data: changes, accounts and
 groups.

 Accounts are stored in a special Git repository `All-Users`.

 Accounts can be grouped in groups. Gerrit has a built-in group system,
 but can also interface to external group system (eg. Google groups,
 LDAP). The built-in groups are stored in `All-Users`.

 Draft comments are stored in `All-Users` too.

 Permissions are stored in Git, in a branch `refs/meta/config` for the
 repository. Repository configuration (including permissions) supports
 single inheritance, with the `All-Projects` repository containing
 site-wide defaults.

 Code review metadata is stored in Git, alongside the code under
 review. Metadata includes change status, votes, comments. This review
 metadata is stored in NoteDb along with the submitted code and code
 under review. Hence, the review history can be exported with `git
 clone --mirror` by anyone with sufficient permissions.

 == Permissions

 Permissions are specified on branch names, and given to groups. For
 example,

 ```
 [access "refs/heads/stable/*"]
         push = group Release-Engineers
 ```

 this provides a rule, granting Release-Engineers push permission for
 stable branches.

 There are fundamentally two types of permissions:

 * Write permissions (who can vote, push, submit etc.)

 * Read permissions (who can see data)

 Read permissions need special treatment across Gerrit, because Gerrit
 should only surface data (including repository existence) if a user
 has read permission. This means that

 * The git wire protocol support must omit references from
   advertisement if the user lacks read permissions

 * Uploads through the git wire protocol must refuse commits that are
   based on SHA-1s for data that the user can't see.

 * Tags are only visible if their commits are visible to user through a
   non-tag reference.

 Metadata (eg. OAuth credentials) is also stored in Git. Existing
 endpoints must refuse creating branches or changes that expose these
 metadata or allow changes to them.


 === Indexing

 Almost all data is stored as Git, but Git only supports fast lookup by
 SHA-1 or by ref (branch) name. Therefore Gerrit also has an indexing
 system (powered by Lucene by default) for other types of queries.
 There are 4 indices:

 * Project index - find repositories by name, parent project, etc.
 * Account index - find accounts by name, email, etc.
 * Group index - find groups by name, owner, description etc.
 * Change index - find changes by file, status, modification date etc.

 The base entities are characterized by SHA-1s. Storing the
 characterizing SHA-1s allows detection of stale index entries.

 == Plug-in architecture

 Gerrit has a plug-in architecture. Plugins can be installed by
 dropping them into $site_directory/plugins, or at runtime through
 plugin SSH commands, or the plugin REST API.

 === Backend plugins

 At runtime, code can be loaded from a `.jar` file. This code can hook
 into predefined extension points. A common use of plugins is to have
 Gerrit interoperate with site-specific tools, such as CI-systems or
 issue trackers.

 // list some notable extension points, and notable plugins
 // link to plugin development

 Some backend plugins expose the JVM for scripting use (eg. Groovy,
 Scala), so plugins can be written without having to setup a Java
 development environment.

 // Luca to expand: how do script plugins load their scripts?

 === Frontend plugins

 The UI can be extended using Frontend plugins. This is useful for
 changing the look & feel of Gerrit, but it can also be used to surface
 data from systems that aren't integrated with the Gerrit backend, eg.
 CI systems or code coverage providers.

 // FE team to write a bit more:
 // * how to load ?
 // * XSRF, CORS ?

 == Internationalization and Localization

 As a source code review system for open source projects, where the
 commonly preferred language for communication is typically English,
 Gerrit does not make internationalization or localization a priority.

 The majority of Gerrit's users will be writing change descriptions
 and comments in English, and therefore an English user interface
 is usable by the target user base.


 == Accessibility Considerations

 // UI team to rewrite this.

 Whenever possible Gerrit displays raw text rather than image icons,
 so screen readers should still be able to provide useful information
 to blind persons accessing Gerrit sites.

 Standard HTML hyperlinks are used rather than HTML div or span tags
 with click listeners.  This provides two benefits to the end-user.
 The first benefit is that screen readers are optimized to locating
 standard hyperlink anchors and presenting them to the end-user as
 a navigation action.  The second benefit is that users can use
 the 'open in new tab/window' feature of their browser whenever
 they choose.

 When possible, Gerrit uses the ARIA properties on DOM widgets to
 provide hints to screen readers.


 == Browser Compatibility

 Gerrit requires a JavaScript enabled browser.

 // UI team to add section on minimum browser requirements.

 As Gerrit is a pure JavaScript application on the client side, with
 no server side rendering fallbacks, the browser must support modern
 JavaScript semantics in order to access the Gerrit web application.
 Dumb clients such as `lynx`, `wget`, `curl`, or even many search engine
 spiders are not able to access Gerrit content.

 All of the content stored within Gerrit is also available through
 other means, such as gitweb or the `git://` protocol. Any existing
 search engine crawlers can index the server-side HTML served by a code
 browser, and thus can index the majority of the changes which might
 appear in Gerrit. Therefore the lack of support for most search engine
 crawlers is a non-issue for most Gerrit deployments.


 == Product Integration

 Gerrit optionally surfaces links to HTML pages in a code browser. The
 links are configurable, and Gerrit comes with a built-in code browser,
 called Gitiles.

 Gerrit integrates with some types of corporate single-sign-on (SSO)
 solutions, typically by having the SSO authentication be performed
 in a reverse proxy web server and then blindly trusting that all
 incoming connections have been authenticated by that reverse proxy.
 When configured to use this form of authentication, Gerrit does
 not integrate with OpenID providers.

 When installing Gerrit, administrators may optionally include an
 HTML header or footer snippet which may include user tracking code,
 such as that used by Google Analytics.  This is a per-instance
 configuration that must be done by hand, and is not supported
 out of the box.  Other site trackers instead of Google Analytics
 can be used, as the administrator can supply any HTML/JavaScript
 they choose.

 Gerrit does not integrate with any Google service, or any other
 services other than those listed above.

 Plugins (see above) can be used to drive product integrations from the
 Gerrit side. Products that support Gerrit explicitly can use the REST
 API or the SSH API to contact Gerrit.


 == Privacy Considerations

 Gerrit stores the following information per user account:

 * Full Name
 * Preferred Email Address

 The full name and preferred email address fields are shown to any
 site visitor viewing a page containing a change uploaded by the
 account owner, or containing a published comment written by the
 account owner.

 Showing the full name and preferred email is approximately the same
 risk as the `From` header of an email posted to a public mailing
 list that maintains archives, and Gerrit treats these fields in
 much the same way that a mailing list archive might handle them.
 Users who don't want to expose this information should either not
 participate in a Gerrit based online community, or open a new email
 address dedicated for this use.

 As the Gerrit UI data is only available through XSRF protected
 JSON-RPC calls, "screen-scraping" for email addresses is difficult,
 but not impossible.  It is unlikely a spammer will go through the
 effort required to code a custom scraping application necessary
 to cull email addresses from published Gerrit comments.  In most
 cases these same addresses would be more easily obtained from the
 project's mailing list archives.

 The user's name and email address is stored unencrypted in the
 link:config-accounts.html#all-users[All-Users] repository.

 == Spam and Abuse Considerations

 There is no spam protection for the Git protocol upload path.
 Uploading a change successfully requires a pre-existing account, and a
 lot of up-front effort.

 Gerrit makes no attempt to detect spam changes or comments in the web
 UI. To post and publish a comment a client must sign in and then use
 the XSRF protected JSON-RPC interface to publish the draft on an
 existing change record.

 Absence of SPAM handling is based upon the idea that Gerrit caters to
 a niche audience, and will therefore be unattractive to spammers. In
 addition, it is not a factor for corporate, on-premise deployments.


 == Scalability

 Gerrit supports the Git wire protocol, and an API (one API for HTTP,
 and one for SSH).

 The git wire protocol does a client/server negotiation to avoid
 sending too much data. This negotiation occupies a CPU, so the number
 of concurrent push/fetch operations should be capped by the number of
 CPUs.

 Clients on slow network connections may be network bound rather than
 server side CPU bound, in which case a core may be effectively shared
 with another user. Possible core sharing due to network bottlenecks
 generally holds true for network connections running below 10 MiB/sec.

 Deployments for large, distributed companies can replicate Git data to
 read-only replicas to offload fetch traffic. The read-only replicas
 should also serve this data using Gerrit to ensure that permissions
 are obeyed.

 The API serves requests of varying costs. Requests that originate in
 the UI can block productivity, so care has been taken to optimize
 these for latency, using the following techniques:

 * Async calls: the UI becomes responsive before some UI elements
   finished loading

 * Caching: metadata is stored in Git, which is relatively expensive to
   access. This is sped up by multiple caches. Metadata entities are
   stored in Git, and can therefore be seen as immutable values keyed
   by SHA-1, which is very amenable to caching. All SHA-1 keyed caches
   can be persisted on local disk.

   The size (memory, disk) of these caches should be adapted to the
   instance size (number of users, size and quantity of repositories)
   for optimal performance.

 Git does not impose fundamental limits (eg. number of files per
 change) on data. To ensure stability, Gerrit configures a number of
 default limits for these.

 // add a link to the default settings.

 === Scaling team size

 A team of size N has N^2 possible interactions. As a result, features
 that expose interactions with activities of other team members has a
 quadratic cost in aggregate. The following features scale poorly with
 large team sizes:

 * the change screen shows conflicting changes by default. This data is
   cached, but updates to pending changes cause cache misses. For a
   single change, the amount of work is proportional to the number of
   pending changes, so in aggregate, the cost of this feature is
   quadratic in the team size.

 * the change screen shows if a change is mergeable to the target
   branch. If the target branch moves quickly (large developer team),
   this causes cache misses. In aggregate, the cost of this feature is
   also quadratic.

 Both features should be turned off for repositories that involve 1000s
 of developers.

 === Browser performance

 // say something about browser performance tuning.

 === Real life numbers


 Gerrit is designed for very large projects, both open source and
 proprietary commercial projects. For a single Gerrit process, the
 following limits are known to work:

 .Observed maximums
 [options="header"]
 |======================================================
 |Parameter        |         Maximum | Deployment
 |Projects         |         50,000  | gerrithub.io
 |Contributors     |        150,000  | eclipse.org
 |Bytes/repo       |        100G     | Qualcomm internal
 |Changes/repo     |        300k     | Qualcomm internal
 |Revisions/Change |        300      | Qualcomm internal
 |Reviewers/Change |        87       | Qualcomm internal
 |======================================================


 // find some numbers for these stats:
 // |Files/repo       |        ? |
 // |Files/Change     |        ? |
 // |Comments/Change  |        ? |
 // |max QPS/CPU      |        ? |


 Google runs a horizontally scaled deployment. We have seen the
 following per-JVM maximums:

 .Observed maximums (googlesource.com)
 [options="header"]
 |======================================================
 |Parameter        |         Maximum | Deployment
 |Files/repo       |        500,000  | chromium-review
 |Bytes/repo       |         12G     | chromium-review
 |Changes/repo     |          500k   | chromium-review
 |Revisions/Change |          1900   | chromium-review
 |Files/Change     |           10,000| android-review
 |Comments/Change  |           1,200 | chromium-review
 |======================================================


 == Redundancy & Reliability

 Gerrit is structured as a single JVM process, reading and writing to a
 single file system. If there are hardware failures in the machine
 running the JVM, or the storage holding the repositories, there is no
 recourse; on failure, errors will be returned to the client.

 Deployments needing more stringent uptime guarantees can use
 replication/multi-master setup, which ensures availability and
 geographical distribution, at the cost of slower write actions.

 // TODO: link.

 === Backups

 Using the standard replication plugin, Gerrit can be configured
 to replicate changes made to the local Git repositories over any
 standard Git transports. After the plugin is installed, remote
 destinations can be configured in `'$site_path'/etc/replication.conf`
 to send copies of all changes over SSH to other servers, or to the
 Amazon S3 blob storage service.


 == Logging Plan

 Gerrit stores Apache style HTTPD logs, as well as ERROR/INFO messages
 from the Java logger, under `$site_dir/logs/`.

 Published comments contain a publication date, so users can judge
 when the comment was posted and decide if it was "recent" or not.
 Only the timestamp is stored in the database, the IP address of
 the comment author is not stored.

 Changes uploaded over the SSH daemon from `git push` have the
 standard Git reflog updated with the date and time that the upload
 occurred, and the Gerrit account identity of who did the upload.
 Changes submitted and merged into a branch also update the
 Git reflog.  These logs are available only to the Gerrit site
 administrator, and they are not replicated through the automatic
 replication noted earlier.  These logs are primarily recorded for an
 "oh s**t" moment where the administrator has to rewind data.  In most
 installations they are a waste of disk space.  Future versions of
 JGit may allow disabling these logs, and Gerrit may take advantage
 of that feature to stop writing these logs.

 A web server positioned in front of Gerrit (such as a reverse proxy)
 or the hosting servlet container may record access logs, and these
 logs may be mined for usage information.  This is outside of the
 scope of Gerrit.


 GERRIT
 ------
 Part of link:index.html[Gerrit Code Review]

 SEARCHBOX
 ---------