blob: 5636dfda3d7eceeaeacdb4ea1767fb793cb7fa7a [file] [log] [blame]
:linkattrs:
= Gerrit Code Review - System Design
== Objective
Gerrit is a web based code review system, facilitating online code
reviews for projects using the Git version control system.
Gerrit makes reviews easier by showing changes in a side-by-side
display, and allowing inline/file comments to be added by any reviewer.
Gerrit simplifies Git based project maintainership by permitting
any authorized user to submit changes to the master Git repository,
rather than requiring all approved changes to be merged in by
hand by the project maintainer. This functionality enables a more
centralized usage of Git.
== Background
Git is a distributed version control system, wherein each repository
is assumed to be owned/maintained by a single user. There are no
inherent security controls built into Git, so the ability to read
from or write to a repository is controlled entirely by the host's
filesystem or network access controls.
The objective of Gerrit is to facilitate Git development by larger
teams: it provides a means to enforce organizational policies around
code submissions, eg. "all code must be reviewed by another
developer", "all code shall pass tests". It achieves this by
* providing fine-grained (per-branch, per-repository, inheriting)
access controls, which allow a Gerrit admin to delegate permissions
to different team(-lead)s.
* facilitate code review: Gerrit offers a web view of pending code
changes, that allows for easy reading and commenting by humans. The
web view can offer data coming out of automated QA processes (eg.
CI). The permission system also includes fine grained control of who
can approve pending changes for submission to further facilitate
delegation of code ownership.
== Overview
Developers create one or more changes on their local desktop system,
then upload them for review to Gerrit using the standard `git push`
command line program, or any GUI which can invoke `git push` on behalf
of the user. Authentication and data transfer are handled through SSH
and HTTPS. Uploads are protected by the authentication,
confidentiality and integrity offered by the transport (SSH, HTTPS).
Each Git commit created on the client desktop system is converted into
a unique change record which can be reviewed independently.
A summary of each newly uploaded change is automatically emailed
to reviewers, so they receive a direct hyperlink to review the
change on the web. Reviewer email addresses can be specified on the
`git push` command line, but typically reviewers are added in the web
interface.
Reviewers use the web interface to read the side-by-side or unified
diff of a change, and insert draft inline/file comments where
appropriate. A draft comment is visible only to the reviewer, until
they publish those comments. Published comments are automatically
emailed to the change author by Gerrit, and are CC'd to all other
reviewers who have already commented on the change.
Reviewers can score the change ("vote"), indicating whether they feel the
change is ready for inclusion in the project, needs more work, or
should be rejected outright. These scores provide direct feedback to
Gerrit's change submit function.
After a change has been scored positively by reviewers, Gerrit enables
a submit button on the web interface. Authorized users can push the
submit button to have the change enter the project repository. The
user pressing the submit button does not need to be the author of the
change.
== Infrastructure
End-user web browsers make HTTP requests directly to Gerrit's
HTTP server. As nearly all of the Gerrit user interface is implemented
in a JavaScript based web app, the majority of these requests are
transmitting compressed JSON payloads, with all HTML being generated
within the browser.
Gerrit's HTTP server side component is implemented as a standard Java
servlet, and thus runs within any link:install-j2ee.html[J2EE servlet
container]. The standard install will run inside Jetty, which is
included in the binary.
End-user uploads are performed over SSH or HTTP, so Gerrit's servlets
also start up a background thread to receive SSH connections through
an independent SSH port. SSH clients communicate directly with this
port, bypassing the HTTP server used by browsers.
User authentication is handled by identity realms. Gerrit supports the
following types of authentication:
* OpenId (see link:http://openid.net/developers/specs/[OpenID Specifications,role=external,window=_blank])
* OAuth2
* LDAP
* Google accounts (on googlesource.com)
* SAML
* Kerberos
* 3rd party SSO
=== NoteDb
Server side data storage for Gerrit is broken down into two different
categories:
* Git repository data
* Gerrit metadata
The Git repository data is the Git object database used to store
already submitted revisions, as well as all uploaded (proposed)
changes. Gerrit uses the standard Git repository format, and
therefore requires direct filesystem access to the repositories.
All repository data is stored in the filesystem and accessed through
the JGit library. Repository data can be stored on remote servers
accessible through NFS or SMB, but the remote directory must
be mounted on the Gerrit server as part of the local filesystem
namespace. Remote filesystems are likely to perform worse than
local ones, due to Git disk IO behavior not being optimized for
remote access.
The Gerrit metadata contains a summary of the available changes, all
comments (published and drafts), and individual user account
information.
Gerrit metadata is also stored in Git, with the commits marking the
historical state of metadata. Data is stored in the trees associated
with the commits, typically using Git config file or JSON as the base
format. For metadata, there are 3 types of data: changes, accounts and
groups.
Accounts are stored in a special Git repository `All-Users`.
Accounts can be grouped in groups. Gerrit has a built-in group system,
but can also interface to external group system (eg. Google groups,
LDAP). The built-in groups are stored in `All-Users`.
Draft comments are stored in `All-Users` too.
Permissions are stored in Git, in a branch `refs/meta/config` for the
repository. Repository configuration (including permissions) supports
single inheritance, with the `All-Projects` repository containing
site-wide defaults.
Code review metadata is stored in Git, alongside the code under
review. Metadata includes change status, votes, comments. This review
metadata is stored in NoteDb along with the submitted code and code
under review. Hence, the review history can be exported with `git
clone --mirror` by anyone with sufficient permissions.
== Permissions
Permissions are specified on branch names, and given to groups. For
example,
```
[access "refs/heads/stable/*"]
push = group Release-Engineers
```
this provides a rule, granting Release-Engineers push permission for
stable branches.
There are fundamentally two types of permissions:
* Write permissions (who can vote, push, submit etc.)
* Read permissions (who can see data)
Read permissions need special treatment across Gerrit, because Gerrit
should only surface data (including repository existence) if a user
has read permission. This means that
* The git wire protocol support must omit references from
advertisement if the user lacks read permissions
* Uploads through the git wire protocol must refuse commits that are
based on SHA-1s for data that the user can't see.
* Tags are only visible if their commits are visible to user through a
non-tag reference.
Metadata (eg. OAuth credentials) is also stored in Git. Existing
endpoints must refuse creating branches or changes that expose these
metadata or allow changes to them.
=== Indexing
Almost all data is stored as Git, but Git only supports fast lookup by
SHA-1 or by ref (branch) name. Therefore Gerrit also has an indexing
system (powered by Lucene by default) for other types of queries.
There are 4 indices:
* Project index - find repositories by name, parent project, etc.
* Account index - find accounts by name, email, etc.
* Group index - find groups by name, owner, description etc.
* Change index - find changes by file, status, modification date etc.
The base entities are characterized by SHA-1s. Storing the
characterizing SHA-1s allows detection of stale index entries.
== Plug-in architecture
Gerrit has a plug-in architecture. Plugins can be installed by
dropping them into $site_directory/plugins, or at runtime through
plugin SSH commands, or the plugin REST API.
=== Backend plugins
At runtime, code can be loaded from a `.jar` file. This code can hook
into predefined extension points. A common use of plugins is to have
Gerrit interoperate with site-specific tools, such as CI-systems or
issue trackers.
// list some notable extension points, and notable plugins
// link to plugin development
Some backend plugins expose the JVM for scripting use (eg. Groovy,
Scala), so plugins can be written without having to setup a Java
development environment.
// Luca to expand: how do script plugins load their scripts?
=== Frontend plugins
The UI can be extended using Frontend plugins. This is useful for
changing the look & feel of Gerrit, but it can also be used to surface
data from systems that aren't integrated with the Gerrit backend, eg.
CI systems or code coverage providers.
// FE team to write a bit more:
// * how to load ?
// * XSRF, CORS ?
== Internationalization and Localization
As a source code review system for open source projects, where the
commonly preferred language for communication is typically English,
Gerrit does not make internationalization or localization a priority.
The majority of Gerrit's users will be writing change descriptions
and comments in English, and therefore an English user interface
is usable by the target user base.
== Accessibility Considerations
// UI team to rewrite this.
Whenever possible Gerrit displays raw text rather than image icons,
so screen readers should still be able to provide useful information
to blind persons accessing Gerrit sites.
Standard HTML hyperlinks are used rather than HTML div or span tags
with click listeners. This provides two benefits to the end-user.
The first benefit is that screen readers are optimized to locating
standard hyperlink anchors and presenting them to the end-user as
a navigation action. The second benefit is that users can use
the 'open in new tab/window' feature of their browser whenever
they choose.
When possible, Gerrit uses the ARIA properties on DOM widgets to
provide hints to screen readers.
== Browser Compatibility
Gerrit requires a JavaScript enabled browser.
// UI team to add section on minimum browser requirements.
As Gerrit is a pure JavaScript application on the client side, with
no server side rendering fallbacks, the browser must support modern
JavaScript semantics in order to access the Gerrit web application.
Dumb clients such as `lynx`, `wget`, `curl`, or even many search engine
spiders are not able to access Gerrit content.
All of the content stored within Gerrit is also available through
other means, such as gitweb or the `git://` protocol. Any existing
search engine crawlers can index the server-side HTML served by a code
browser, and thus can index the majority of the changes which might
appear in Gerrit. Therefore the lack of support for most search engine
crawlers is a non-issue for most Gerrit deployments.
== Product Integration
Gerrit optionally surfaces links to HTML pages in a code browser. The
links are configurable, and Gerrit comes with a built-in code browser,
called Gitiles.
Gerrit integrates with some types of corporate single-sign-on (SSO)
solutions, typically by having the SSO authentication be performed
in a reverse proxy web server and then blindly trusting that all
incoming connections have been authenticated by that reverse proxy.
When configured to use this form of authentication, Gerrit does
not integrate with OpenID providers.
When installing Gerrit, administrators may optionally include an
HTML header or footer snippet which may include user tracking code,
such as that used by Google Analytics. This is a per-instance
configuration that must be done by hand, and is not supported
out of the box. Other site trackers instead of Google Analytics
can be used, as the administrator can supply any HTML/JavaScript
they choose.
Gerrit does not integrate with any Google service, or any other
services other than those listed above.
Plugins (see above) can be used to drive product integrations from the
Gerrit side. Products that support Gerrit explicitly can use the REST
API or the SSH API to contact Gerrit.
== Privacy Considerations
Gerrit stores the following information per user account:
* Full Name
* Preferred Email Address
The full name and preferred email address fields are shown to any
site visitor viewing a page containing a change uploaded by the
account owner, or containing a published comment written by the
account owner.
Showing the full name and preferred email is approximately the same
risk as the `From` header of an email posted to a public mailing
list that maintains archives, and Gerrit treats these fields in
much the same way that a mailing list archive might handle them.
Users who don't want to expose this information should either not
participate in a Gerrit based online community, or open a new email
address dedicated for this use.
As the Gerrit UI data is only available through XSRF protected
JSON-RPC calls, "screen-scraping" for email addresses is difficult,
but not impossible. It is unlikely a spammer will go through the
effort required to code a custom scraping application necessary
to cull email addresses from published Gerrit comments. In most
cases these same addresses would be more easily obtained from the
project's mailing list archives.
The user's name and email address is stored unencrypted in the
link:config-accounts.html#all-users[All-Users] repository.
== Spam and Abuse Considerations
There is no spam protection for the Git protocol upload path.
Uploading a change successfully requires a pre-existing account, and a
lot of up-front effort.
Gerrit makes no attempt to detect spam changes or comments in the web
UI. To post and publish a comment a client must sign in and then use
the XSRF protected JSON-RPC interface to publish the draft on an
existing change record.
Absence of SPAM handling is based upon the idea that Gerrit caters to
a niche audience, and will therefore be unattractive to spammers. In
addition, it is not a factor for corporate, on-premise deployments.
== Scalability
Gerrit supports the Git wire protocol, and an API (one API for HTTP,
and one for SSH).
The git wire protocol does a client/server negotiation to avoid
sending too much data. This negotation occupies a CPU, so the number
of concurrent push/fetch operations should be capped by the number of
CPUs.
Clients on slow network connections may be network bound rather than
server side CPU bound, in which case a core may be effectively shared
with another user. Possible core sharing due to network bottlenecks
generally holds true for network connections running below 10 MiB/sec.
Deployments for large, distributed companies can replicate Git data to
read-only replicas to offload fetch traffic. The read-only replicas
should also serve this data using Gerrit to ensure that permissions
are obeyed.
The API serves requests of varying costs. Requests that originate in
the UI can block productivity, so care has been taken to optimize
these for latency, using the following techniques:
* Async calls: the UI becomes responsive before some UI elements
finished loading
* Caching: metadata is stored in Git, which is relatively expensive to
access. This is sped up by multiple caches. Metadata entities are
stored in Git, and can therefore be seen as immutable values keyed
by SHA-1, which is very amenable to caching. All SHA-1 keyed caches
can be persisted on local disk.
The size (memory, disk) of these caches should be adapted to the
instance size (number of users, size and quantity of repositories)
for optimal performance.
Git does not impose fundamental limits (eg. number of files per
change) on data. To ensure stability, Gerrit configures a number of
default limits for these.
// add a link to the default settings.
=== Scaling team size
A team of size N has N^2 possible interactions. As a result, features
that expose interactions with activities of other team members has a
quadratic cost in aggregate. The following features scale poorly with
large team sizes:
* the change screen shows conflicting changes by default. This data is
cached, but updates to pending changes cause cache misses. For a
single change, the amount of work is proportional to the number of
pending changes, so in aggregate, the cost of this feature is
quadratic in the team size.
* the change screen shows if a change is mergeable to the target
branch. If the target branch moves quickly (large developer team),
this causes cache misses. In aggregate, the cost of this feature is
also quadratic.
Both features should be turned off for repositories that involve 1000s
of developers.
=== Browser performance
// say something about browser performance tuning.
=== Real life numbers
Gerrit is designed for very large projects, both open source and
proprietary commercial projects. For a single Gerrit process, the
following limits are known to work:
.Observed maximums
[options="header"]
|======================================================
|Parameter | Maximum | Deployment
|Projects | 50,000 | gerrithub.io
|Contributors | 150,000 | eclipse.org
|Bytes/repo | 100G | Qualcomm internal
|Changes/repo | 300k | Qualcomm internal
|Revisions/Change | 300 | Qualcomm internal
|Reviewers/Change | 87 | Qualcomm internal
|======================================================
// find some numbers for these stats:
// |Files/repo | ? |
// |Files/Change | ? |
// |Comments/Change | ? |
// |max QPS/CPU | ? |
Google runs a horizontally scaled deployment. We have seen the
following per-JVM maximums:
.Observed maximums (googlesource.com)
[options="header"]
|======================================================
|Parameter | Maximum | Deployment
|Files/repo | 500,000 | chromium-review
|Bytes/repo | 12G | chromium-review
|Changes/repo | 500k | chromium-review
|Revisions/Change | 1900 | chromium-review
|Files/Change | 10,000| android-review
|Comments/Change | 1,200 | chromium-review
|======================================================
== Redundancy & Reliability
Gerrit is structured as a single JVM process, reading and writing to a
single file system. If there are hardware failures in the machine
running the JVM, or the storage holding the repositories, there is no
recourse; on failure, errors will be returned to the client.
Deployments needing more stringent uptime guarantees can use
replication/multi-master setup, which ensures availability and
geographical distribution, at the cost of slower write actions.
// TODO: link.
=== Backups
Using the standard replication plugin, Gerrit can be configured
to replicate changes made to the local Git repositories over any
standard Git transports. After the plugin is installed, remote
destinations can be configured in `'$site_path'/etc/replication.conf`
to send copies of all changes over SSH to other servers, or to the
Amazon S3 blob storage service.
== Logging Plan
Gerrit stores Apache style HTTPD logs, as well as ERROR/INFO messages
from the Java logger, under `$site_dir/logs/`.
Published comments contain a publication date, so users can judge
when the comment was posted and decide if it was "recent" or not.
Only the timestamp is stored in the database, the IP address of
the comment author is not stored.
Changes uploaded over the SSH daemon from `git push` have the
standard Git reflog updated with the date and time that the upload
occurred, and the Gerrit account identity of who did the upload.
Changes submitted and merged into a branch also update the
Git reflog. These logs are available only to the Gerrit site
administrator, and they are not replicated through the automatic
replication noted earlier. These logs are primarily recorded for an
"oh s**t" moment where the administrator has to rewind data. In most
installations they are a waste of disk space. Future versions of
JGit may allow disabling these logs, and Gerrit may take advantage
of that feature to stop writing these logs.
A web server positioned in front of Gerrit (such as a reverse proxy)
or the hosting servlet container may record access logs, and these
logs may be mined for usage information. This is outside of the
scope of Gerrit.
GERRIT
------
Part of link:index.html[Gerrit Code Review]
SEARCHBOX
---------