Documentation/dev-design.txt - gerrit - Git at Google

 Gerrit Code Review - System Design
 ==================================

 Objective
 ---------

 Gerrit is a web based code review system, facilitating online code
 reviews for projects using the Git version control system.

 Gerrit makes reviews easier by showing changes in a side-by-side
 display, and allowing inline comments to be added by any reviewer.

 Gerrit simplifies Git based project maintainership by permitting
 any authorized user to submit changes to the master Git repository,
 rather than requiring all approved changes to be merged in by
 hand by the project maintainer.  This functionality enables a more
 centralized usage of Git.


 Background
 ----------

 Google developed Mondrian, a Perforce based code review tool to
 facilitate peer-review of changes prior to submission to the central
 code repository.  Mondrian is not open source, as it is tied to the
 use of Perforce and to many Google-only services, such as Bigtable.
 Google employees have often described how useful Mondrian and its
 peer-review process is to their day-to-day work.

 Guido van Rossum open sourced portions of Mondrian within Rietveld,
 a similar code review tool running on Google App Engine, but for
 use with Subversion rather than Perforce.  Rietveld is in common
 use by many open source projects, facilitating their peer reviews
 much as Mondrian does for Google employees.  Unlike Mondrian and
 the Google Perforce triggers, Rietveld is strictly advisory and
 does not enforce peer-review prior to submission.

 Git is a distributed version control system, wherein each repository
 is assumed to be owned/maintained by a single user.  There are no
 inherit security controls built into Git, so the ability to read
 from or write to a repository is controlled entirely by the host's
 filesystem access controls.  When multiple maintainers collaborate
 on a single shared repository a high degree of trust is required,
 as any collaborator with write access can alter the repository.

 Gitosis provides tools to secure centralized Git repositories,
 permitting multiple maintainers to manage the same project at once,
 by restricting the access to only over a secure network protocol,
 much like Perforce secures a repository by only permitting access
 over its network port.

 The Android Open Source Project (AOSP) was founded by Google by the
 open source releasing of the Android operating system.  AOSP has
 selected Git as its primary version control tool.  As many of the
 engineers have a background of working with Mondrian at Google,
 there is a strong desire to have the same (or better) feature set
 available for Git and AOSP.

 Gerrit Code Review started as a simple set of patches to Rietveld,
 and was originally built to service AOSP. This quickly turned
 into a fork as we added access control features that Guido van
 Rossum did not want to see complicating the Rietveld code base. As
 the functionality and code were starting to become drastically
 different, a different name was needed. Gerrit calls back to the
 original namesake of Rietveld, Gerrit Rietveld, a Dutch architect.

 Gerrit 2.x is a complete rewrite of the Gerrit fork, completely
 changing the implementation from Python on Google App Engine, to Java
 on a J2EE servlet container and a SQL database.

 * link:http://video.google.com/videoplay?docid=-8502904076440714866[Mondrian Code Review On The Web]
 * link:http://code.google.com/p/rietveld/[Rietveld - Code Review for Subversion]
 * link:http://eagain.net/gitweb/?p=gitosis.git;a=blob;f=README.rst;hb=HEAD[Gitosis README]
 * link:http://source.android.com/[Android Open Source Project]


 Overview
 --------

 Developers create one or more changes on their local desktop system,
 then upload them for review to Gerrit using the standard `git push`
 command line program, or any GUI which can invoke `git push` on
 behalf of the user.  Authentication and data transfer are handled
 through SSH.  Users are authenticated by username and public/private
 key pair, and all data transfer is protected by the SSH connection
 and Git's own data integrity checks.

 Each Git commit created on the client desktop system is converted
 into a unique change record which can be reviewed independently.
 Change records are stored in PostgreSQL, where they can be queried to
 present customized user dashboards, enumerating any pending changes.

 A summary of each newly uploaded change is automatically emailed
 to reviewers, so they receive a direct hyperlink to review the
 change on the web.  Reviewer email addresses can be specified on the
 `git push` command line, but typically reviewers are automatically
 selected by Gerrit by identifying users who have change approval
 permissions in the project.

 Reviewers use the web interface to read the side-by-side or unified
 diff of a change, and insert draft inline comments where appropriate.
 A draft comment is visible only to the reviewer, until they publish
 those comments.  Published comments are automatically emailed to
 the change author by Gerrit, and are CC'd to all other reviewers
 who have already commented on the change.

 When publishing comments reviewers are also given the opportunity
 to score the change, indicating whether they feel the change is
 ready for inclusion in the project, needs more work, or should be
 rejected outright.  These scores provide direct feedback to Gerrit's
 change submit function.

 After a change has been scored positively by reviewers, Gerrit
 enables a submit button on the web interface.  Authorized users
 can push the submit button to have the change enter the project
 repository.  The equivilant in Subversion or Perforce would be
 that Gerrit is invoking `svn commit` or `p4 submit` on behalf of
 the web user pressing the button.  Due to the way Git audit trails
 are maintained, the user pressing the submit button does not need
 to be the author of the change.


 Infrastructure
 --------------

 End-user web browsers make HTTP requests directly to Gerrit's
 HTTP server.  As nearly all of the user interface is implemented
 through Google Web Toolkit (GWT), the majority of these requests
 are transmitting compressed JSON payloads, with all HTML being
 generated within the browser.  Most responses are under 1 KB.

 Gerrit's HTTP server side component is implemented as a standard
 Java servlet, and thus runs within any J2EE servlet container.
 Popular choices for deployments would be Tomcat or Jetty, as these
 are high-quality open-source servlet containers that are readily
 available for download.

 End-user uploads are performed over SSH, so Gerrit's servlets also
 start up a background thread to receive SSH connections through
 an independent SSH port.  SSH clients communicate directly with
 this port, bypassing the HTTP server used by browsers.

 Server side data storage for Gerrit is broken down into two different
 categories:

 * Git repository data
 * Gerrit metadata

 The Git repository data is the Git object database used to store
 already submitted revisions, as well as all uploaded (proposed)
 changes.  Gerrit uses the standard Git repository format, and
 therefore requires direct filesystem access to the repositories.
 All repository data is stored in the filesystem and accessed through
 the JGit library.  Repository data can be stored on remote servers
 accessible through NFS or SMB, but the remote directory must
 be mounted on the Gerrit server as part of the local filesystem
 namespace.  Remote filesystems are likely to perform worse than
 local ones, due to Git disk IO behavior not being optimized for
 remote access.

 The Gerrit metadata contains a summary of the available changes,
 all comments (published and drafts), and individual user account
 information.  The metadata is housed in a PostgreSQL database,
 which can be located either on the same server as Gerrit, or on
 a different (but nearby) server.  Most installations would opt to
 install both Gerrit and PostgreSQL on the same server, to reduce
 administration overheads.

 User authentication is handled by OpenID, and therefore Gerrit
 requires that the OpenID provider selected by a user must be
 online and operating in order to authenticate that user.

 * link:http://code.google.com/webtoolkit/[Google Web Toolkit (GWT)]
 * link:http://www.kernel.org/pub/software/scm/git/docs/gitrepository-layout.html[Git Repository Format]
 * link:http://www.postgresql.org/about/[About PostgreSQL]
 * link:http://openid.net/developers/specs/[OpenID Specifications]


 Project Information
 -------------------

 Gerrit is developed as a self-hosting open source project:

 * link:http://code.google.com/p/gerrit/[Project Homepage]
 * link:http://code.google.com/p/gerrit/downloads/list[Release Versions]
 * link:http://code.google.com/p/gerrit/wiki/Source?tm=4[Source]
 * link:http://code.google.com/p/gerrit/issues/list[Issue Tracking]
 * link:https://review.source.android.com/[Change Review]


 Internationalization and Localization
 -------------------------------------

 As a source code review system for open source projects, where the
 commonly preferred language for communication is typically English,
 Gerrit does not make internationalization or localization a priority.

 The majority of Gerrit's users will be writing change descriptions
 and comments in English, and therefore an English user interface
 is usable by the target user base.

 Gerrit uses GWT's i18n support to externalize all constant strings
 and messages shown to the user, so that in the future someone who
 really needed a translated version of the UI could contribute new
 string files for their locale(s).

 Right-to-left (RTL) support is only barely considered within the
 Gerrit code base.  Some portions of the code have tried to take
 RTL into consideration, while others probably need to be modified
 before translating the UI to an RTL language.

 * link:i18n-readme.html[Gerrit's i18n Support]


 Accessibility Considerations
 ----------------------------

 Whenever possible Gerrit displays raw text rather than image icons,
 so screen readers should still be able to provide useful information
 to blind persons accessing Gerrit sites.

 Standard HTML hyperlinks are used rather than HTML div or span tags
 with click listeners.  This provides two benefits to the end-user.
 The first benefit is that screen readers are optimized to locating
 standard hyperlink anchors and presenting them to the end-user as
 a navigation action.  The second benefit is that users can use
 the 'open in new tab/window' feature of their browser whenever
 they choose.

 When possible, Gerrit uses the ARIA properties on DOM widgets to
 provide hints to screen readers.


 Browser Compatibility
 ---------------------

 Supporting non-JavaScript enabled browsers is a non-goal for Gerrit.

 As Gerrit is a pure-GWT application with no server side rendering
 fallbacks, the browser must support modern JavaScript semantics in
 order to access the Gerrit web application.  Dumb clients such as
 `lynx`, `wget`, `curl`, or even many search engine spiders are not
 able to access Gerrit content.

 As Google Web Toolkit (GWT) is used to generate the browser
 specific versions of the client-side JavaScript code, Gerrit works
 on any JavaScript enabled browser which GWT can produce code for.
 This covers the majority of the popular browsers.

 The Gerrit project does not have the development resources necessary
 to support two parallel UI implementations (GWT based JavaScript
 and server-side rendering).  Consequently only one is implemented.

 There are number of web browsers available with full JavaScript
 support, and nearly every operating system (including any PDA-like
 mobile phone) comes with one standard.  Users who are committed
 to developing changes for a Gerrit managed project can be expected
 to be able to run a JavaScript enabled browser, as they also would
 need to be running Git in order to contribute.

 There are a number of open source browsers available, including
 Firefox and Chromium.  Users have some degree of choice in their
 browser selection, including being able to build and audit their
 browser from source.

 The majority of the content stored within Gerrit is also available
 through other means, such as gitweb or the `git://` protocol.
 Any existing search engine spider can crawl the server-side HTML
 produced by gitweb, and thus can index the majority of the changes
 which might appear in Gerrit.  Some engines may even choose to
 crawl the native version control database, such as ohloh.net does.
 Therefore the lack of support for most search engine spiders is a
 non-issue for most Gerrit deployments.


 Product Integration
 -------------------

 Gerrit integrates with an existing gitweb installation by optionally
 creating hyperlinks to reference changes on the gitweb server.

 Gerrit integrates with an existing git-daemon installation by
 optionally displaying `git://` URLs for users to download a
 change through the native Git protocol.

 Gerrit integrates with any OpenID provider for user authentication,
 making it easier for users to join a Gerrit site and manage their
 authentication credentials to it.  To make use of Google Accounts
 as an OpenID provider easier, Gerrit has a shorthand "Sign in with
 a Google Account" link on its sign-in screen.  Gerrit also supports
 a shorthand sign in link for Yahoo!.  Other providers may also be
 supported more directly in the future.

 Site administrators may limit the range of OpenID providers to
 a subset of "reliable providers".  Users may continue to use
 any OpenID provider to publish comments, but granted privileges
 are only available to a user if the only entry point to their
 account is through the defined set of "reliable OpenID providers".
 This permits site administrators to require HTTPS for OpenID,
 and to use only large main-stream providers that are trustworthy,
 or to require users to only use a custom OpenID provider installed
 alongside Gerrit Code Review.

 Gerrit integrates with some types of corporate single-sign-on (SSO)
 solutions, typically by having the SSO authentication be performed
 in a reverse proxy web server and then blindly trusting that all
 incoming connections have been authenticated by that reverse proxy.
 When configured to use this form of authentication, Gerrit does
 not integrate with OpenID providers.

 When installing Gerrit, administrators may optionally include an
 HTML header or footer snippet which may include user tracking code,
 such as that used by Google Analytics.  This is a per-instance
 configuration that must be done by hand, and is not supported
 out of the box.  Other site trackers instead of Google Analytics
 can be used, as the administrator can supply any HTML/JavaScript
 they choose.

 Gerrit does not integrate with any Google service, or any other
 services other than those listed above.


 Standards / Developer APIs
 --------------------------

 Gerrit uses an XSRF protected variant of JSON-RPC 1.1 to communicate
 between the browser client and the server.

 As the protocol is not the GWT-RPC protocol, but is instead a
 self-describing standard JSON format it is easily implemented by
 any 3rd party client application, provided the client has a JSON
 parser and HTTP client library available.

 As the entire command set necessary for the standard web browser
 based UI is exposed through JSON-RPC over HTTP, there are no other
 data feeds or command interfaces to the server.

 Commands requiring user authentication may require the user agent to
 complete a sign-in cycle through the user's OpenID provider in order
 to establish the HTTP cookie Gerrit uses to track user identity.
 Automating this sign-in process for non-web browser agents is
 outside of the scope of Gerrit, as each OpenID provider uses its own
 sign-in sequence.  Use of OpenID providers which have difficult to
 automate interfaces may make it impossible for non-browser agents
 to be used with the JSON-RPC interface.

 * link:http://json-rpc.org/wd/JSON-RPC-1-1-WD-20060807.html[JSON-RPC 1.1]
 * link:http://android.git.kernel.org/?p=tools/gwtjsonrpc.git;a=blob;f=README;hb=HEAD[XSRF JSON-RPC]


 Privacy Considerations
 ----------------------

 Gerrit stores the following information per user account:

 * Full Name
 * Preferred Email Address
 * Mailing Address '(Optional, Encrypted)'
 * Country '(Optional, Encrypted)'
 * Phone Number '(Optional, Encrypted)'
 * Fax Number '(Optional, Encrypted)'

 The full name and preferred email address fields are shown to any
 site visitor viewing a page containing a change uploaded by the
 account owner, or containing a published comment written by the
 account owner.

 Showing the full name and preferred email is approximately the same
 risk as the `From` header of an email posted to a public mailing
 list that maintains archives, and Gerrit treats these fields in
 much the same way that a mailing list archive might handle them.
 Users who don't want to expose this information should either not
 participate in a Gerrit based online community, or open a new email
 address dedicated for this use.

 As the Gerrit UI data is only available through XSRF protected
 JSON-RPC calls, "screen-scraping" for email addresses is difficult,
 but not impossible.  It is unlikely a spammer will go through the
 effort required to code a custom scraping application necessary
 to cull email addresses from published Gerrit comments.  In most
 cases these same addresses would be more easily obtained from the
 project's mailing list archives.

 The user's name and email address is stored unencrypted in the
 Gerrit metadata store, typically a PostgreSQL database.

 The snail-mail mailing address, country, and phone and fax numbers
 are gathered to help project leads contact the user should there
 be a legal question regarding any change they have uploaded.

 These sensitive fields are immediately encrypted upon receipt with
 a GnuPG public key, and stored "off site" in another data store,
 isolated from the main Gerrit change data.  Gerrit does not have
 access to the matching private key, and as such cannot decrypt the
 information.  Therefore these fields are write-once in Gerrit, as not
 even the account owner can recover the values they previously stored.

 It is expected that the address information would only need to be
 decrypted and revealed with a valid court subpoena, but this is
 really left to the discretion of the Gerrit site administrator as
 to when it is reasonable to reveal this information to a 3rd party.


 Spam and Abuse Considerations
 -----------------------------

 Gerrit makes no attempt to detect spam changes or comments.  The
 somewhat high barrier to entry makes it unlikely that a spammer
 will target Gerrit.

 To upload a change, the client must speak the native Git protocol
 embedded in SSH, with some custom Gerrit semantics added on top.
 The client must have their public key already stored in the Gerrit
 database, which can only be done through the XSRF protected
 JSON-RPC interface.  The level of effort required to construct
 the necessary tools to upload a well-formatted change that isn't
 rejected outright by the Git and Gerrit checksum validations is
 too high to for a spammer to get any meaningful return.

 To post and publish a comment a client must sign in with an OpenID
 provider and then use the XSRF protected JSON-RPC interface to
 publish the draft on an existing change record.  Again, the level of
 effort required to implement the Gerrit specific XSRF protections
 and the JSON-RPC payload format necessary to post a draft and then
 publish that draft is simply too high for a spammer to bother with.

 Both of these assumptions are also based upon the idea that Gerrit
 will be a lot less popular than blog software, and thus will be
 running on a lot less websites.  Spammers therefore have very little
 returned benefit for getting over the protocol hurdles.

 These assumptions may need to be revisited in the future if any
 public Gerrit site actually notices spam.


 Latency
 -------

 Gerrit targets for sub-250 ms per page request, mostly by using
 very compact JSON payloads bewteen client and server.  However, as
 most of the serving stack (network, hardware, PostgreSQL metadata
 database) is out of control of the Gerrit developers, no real
 guarantees can be made about latency.


 Scalability
 -----------

 Gerrit is designed for an open source project.  Roughly this
 amounts to parameters such as the following:

 .Design Parameters
 [grid="all"]
 `-----------------'----------------
 Parameter         Estimated Maximum
 -----------------------------------
 Projects            500
 Contributors      2,000
 Changes/Day         400
 Revisions/Change    2.0
 Files/Change        4
 Comments/File       2
 Reviewers/Change    1.0
 -----------------------------------

 CPU Usage
 ~~~~~~~~~

 Very few, if any open source projects have more than a handful of
 Git repositories associated with them.  Since Gerrit treats one
 Git repository as a project, an assumed limit of 500 projects
 is reasonable.  Only an operating system distribution project
 would really need to be tracking more than a handful of discrete
 Git repositories.

 Almost no open source project has 2,000 contributors over all time,
 let alone on a daily basis.  This figure of 2,000 was WAG'd by
 looking at PR statements published by cell phone companies picking
 up the Android operating system.  If all of the stated employees in
 those PR statements were working on *only* the open source Android
 repositories, we might reach the 2,000 estimate listed here.  Knowing
 these companies as being very closed-source minded in the past, it
 is very unlikely all of their Android engineers will be working on
 the open source repository, and thus 2,000 is a very high estimate.

 The estimate of 400 changes per day was WAG'd off some estimates
 originally obtained from Android's development history.  Writing a
 good change that will be accepted through a peer-review process
 takes time.  The average engineer may need 4-6 hours per change just
 to write the code and unit tests.  Proper design consideration and
 additional but equally important tasks such as meetings, interviews,
 training, and eating lunch will often pad the engineer's day out
 such that suitable changes are only posted once a day, or once
 every other day.  For reference, the entire Linux kernel has an
 average of only 79 changes/day.

 The estimate of 2 revisions/change means that on average any
 given change will need to be modified once to address peer review
 comments before the final revision can be accepted by the project.
 Executing these revisions also eats into the contributor's time,
 and is another factor limiting the number of changes/day accepted
 by the Gerrit instance.

 The estimate of 1 reviewer/change means that on average only one
 person will comment on a change.  Usually this would be the project
 lead, or someone who is familiar with the code being modified.
 The time required to comment further reduces the time available
 for writing one's own changes.

 Gerrit's web UI would require on average `4+F+F*C` HTTP requests to
 review a change and post comments.  Here `F` is the number of files
 modified by the change, and `C` is the number of inline comments left
 by the reviewer per file.  The constant 4 accounts for the request
 to load the reviewer's dashboard, to load the change detail page,
 to publish the review comments, and to reload the change detail
 page after comments are published.

 This WAG'd estimate boils down to <12,800 HTTP requests per day
 (QPD). Assuming these are evenly distributed over an 8 hour work day
 in a single time zone, we are looking at approximately 0.43 queries
 per second (QPS).

 ----
   QPD = Changes_Day * Revisions_Change * Reviewers_Change * (4 + F + F * C)
       = 400         * 2.0              * 1.0              * (4 + 4 + 4 * 2)
       = 12,800
   QPS = QPD / 8_Hours / 60_Minutes / 60_Seconds
       = 0.43
 ----

 Gerrit serves most requests in under 60 ms when using the loopback
 interface and a single processor.  On a single CPU system there is
 sufficient capacity for 16 QPS.  A dual processor system should be
 sufficient for a site with the estimated load described above.

 Given a more realistic estimate of 79 changes per day (from the
 Linux kernel) suggests only 2,528 queries per day, and a much lower
 0.08 QPS when spread out over an 8 hour work day.

 Disk Usage
 ~~~~~~~~~~

 The average size of a revision in the Linux kernel once compressed
 by Git is 2,327 bytes, or roughly 2 KB.  Over the course of a year
 a Gerrit server running with the parameters above might see an
 introduction of 570 MB over the total set of 500 projects hosted in
 that server.  This figure assumes the majorty of the content is human
 written source code, and not large binary blobs such as disk images.


 Redundancy & Reliability
 ------------------------

 Gerrit largely assumes that the local filesystem where Git repository
 data is stored is always available.  Important data written to disk
 is also forced to the platter with an `fsync()` once it has been
 fully written.  If the local filesystem fails to respond to reads
 or becomes corrupt, Gerrit has no provisions to fallback or retry
 and errors will be returned to clients.

 Gerrit largely assumes that the metadata PostgreSQL database is
 online and answering both read and write queries.  Query failures
 immediately result in the operation aborting and errors being
 returned to the client, with no retry or fallback provisions.

 Due to the relatively small scale described above, it is very likely
 that the Git filesystem and PostgreSQL based metadata database
 are all housed on the same server that is running Gerrit.  If any
 failure arises in one of these components, it is likely to manifest
 in the others too.  It is also likely that the administrator cannot
 be bothered to deploy a cluster of load-balanced server hardware,
 as the scale and expected load does not justify the hardware or
 management costs.

 Most deployments caring about reliability will setup a warm-spare
 standby system and use a manual fail-over process to switch from the
 failed system to the warm-spare.

 As Git is a distributed version control system, and open source
 projects tend to have contributors from all over the world, most
 contributors will be able to tolerate a Gerrit down time of several
 hours while the administrator is notified, signs on, and brings the
 warm-spare up.  Pending changes are likely to need at least 24 hours
 of time on the Gerrit site anyway in order to ensure any interested
 parties around the world have had a chance to comment.  This expected
 lag largely allows for some downtime in a disaster scenario.

 Backups
 ~~~~~~~

 PostgreSQL can be configured to save its write-ahead-log (WAL)
 and ship these logs to other systems, where they are applied to
 a warm-standby backup in real time.  Gerrit instances which care
 about reduduncy will setup this feature of PostgreSQL to ensure
 the warm-standby is reasonably current should the master go offline.

 Gerrit can be configured to replicate changes made to the local
 Git repositories over any standard Git transports.  This can be
 configured in `'$site_path'/etc/replication.conf` to send copies
 of all changes over SSH to other servers, or to the Amazon S3 blob
 storage service.


 Logging Plan
 ------------

 Gerrit does not maintain logs on its own.

 Published comments contain a publication date, so users can judge
 when the comment was posted and decide if it was "recent" or not.
 Only the timestamp is stored in the database, the IP address of
 the comment author is not stored.

 Changes uploaded over the SSH daemon from `git push` have the
 standard Git reflog updated with the date and time that the upload
 occurred, and the Gerrit account identity of who did the upload.
 Changes submitted and merged into a branch also update the
 Git reflog.  These logs are available only to the Gerrit site
 administrator, and they are not replicated through the automatic
 replication noted earlier.  These logs are primarly recorded for an
 "oh s**t" moment where the administrator has to rewind data.  In most
 installations they are a waste of disk space.  Future versions of
 JGit may allow disabling these logs, and Gerrit may take advantage
 of that feature to stop writing these logs.

 A web server positioned in front of Gerrit (such as a reverse proxy)
 or the hosting servlet container may record access logs, and these
 logs may be mined for usage information.  This is outside of the
 scope of Gerrit.


 Testing Plan
 ------------

 Gerrit is currently manually tested through its web UI.

 JGit has a fairly extensive automated unit test suite.  Most new
 changes to JGit are rejected unless corresponding automated unit
 tests are included.


 Caveats
 -------

 Reitveld can't be used as it does not provide the "submit over the
 web" feature that Gerrit provides for Git.

 Gitosis can't be used as it does not provide any code review
 features, but it does provide basic access controls.

 Email based code review does not scale to a project as large and
 complex as Android.  Most contributors at least need some sort of
 dashboard to keep track of any pending reviews, and some way to
 correlate updated revisions back to the comments written on prior
 revisions of the same logical change.

 GERRIT
 ------
 Part of link:index.html[Gerrit Code Review]