| :linkattrs: |
| = Gerrit Code Review - System Design |
| |
| == Objective |
| |
| Gerrit is a web based code review system, facilitating online code |
| reviews for projects using the Git version control system. |
| |
| Gerrit makes reviews easier by showing changes in a side-by-side |
| display, and allowing inline/file comments to be added by any reviewer. |
| |
| Gerrit simplifies Git based project maintainership by permitting |
| any authorized user to submit changes to the master Git repository, |
| rather than requiring all approved changes to be merged in by |
| hand by the project maintainer. This functionality enables a more |
| centralized usage of Git. |
| |
| |
| == Background |
| |
| Git is a distributed version control system, wherein each repository |
| is assumed to be owned/maintained by a single user. There are no |
| inherent security controls built into Git, so the ability to read |
| from or write to a repository is controlled entirely by the host's |
| filesystem or network access controls. |
| |
| The objective of Gerrit is to facilitate Git development by larger |
| teams: it provides a means to enforce organizational policies around |
| code submissions, eg. "all code must be reviewed by another |
| developer", "all code shall pass tests". It achieves this by |
| |
| * providing fine-grained (per-branch, per-repository, inheriting) |
| access controls, which allow a Gerrit admin to delegate permissions |
| to different team(-lead)s. |
| |
| * facilitate code review: Gerrit offers a web view of pending code |
| changes, that allows for easy reading and commenting by humans. The |
| web view can offer data coming out of automated QA processes (eg. |
| CI). The permission system also includes fine grained control of who |
| can approve pending changes for submission to further facilitate |
| delegation of code ownership. |
| |
| == Overview |
| |
| Developers create one or more changes on their local desktop system, |
| then upload them for review to Gerrit using the standard `git push` |
| command line program, or any GUI which can invoke `git push` on behalf |
| of the user. Authentication and data transfer are handled through SSH |
| and HTTPS. Uploads are protected by the authentication, |
| confidentiality and integrity offered by the transport (SSH, HTTPS). |
| |
| Each Git commit created on the client desktop system is converted into |
| a unique change record which can be reviewed independently. |
| |
| A summary of each newly uploaded change is automatically emailed |
| to reviewers, so they receive a direct hyperlink to review the |
| change on the web. Reviewer email addresses can be specified on the |
| `git push` command line, but typically reviewers are added in the web |
| interface. |
| |
| Reviewers use the web interface to read the side-by-side or unified |
| diff of a change, and insert draft inline/file comments where |
| appropriate. A draft comment is visible only to the reviewer, until |
| they publish those comments. Published comments are automatically |
| emailed to the change author by Gerrit, and are CC'd to all other |
| reviewers who have already commented on the change. |
| |
| Reviewers can score the change ("vote"), indicating whether they feel the |
| change is ready for inclusion in the project, needs more work, or |
| should be rejected outright. These scores provide direct feedback to |
| Gerrit's change submit function. |
| |
| After a change has been scored positively by reviewers, Gerrit enables |
| a submit button on the web interface. Authorized users can push the |
| submit button to have the change enter the project repository. The |
| user pressing the submit button does not need to be the author of the |
| change. |
| |
| |
| == Infrastructure |
| |
| End-user web browsers make HTTP requests directly to Gerrit's |
| HTTP server. As nearly all of the user interface is implemented |
| through PolyGerrit, the majority of these requests are transmitting |
| compressed JSON payloads, with all HTML being generated within the |
| browser. |
| |
| Gerrit's HTTP server side component is implemented as a standard Java |
| servlet, and thus runs within any link:install-j2ee.html[J2EE servlet |
| container]. The standard install will run inside Jetty, which is |
| included in the binary. |
| |
| End-user uploads are performed over SSH or HTTP, so Gerrit's servlets |
| also start up a background thread to receive SSH connections through |
| an independent SSH port. SSH clients communicate directly with this |
| port, bypassing the HTTP server used by browsers. |
| |
| User authentication is handled by identity realms. Gerrit supports the |
| following types of authentication: |
| |
| * OpenId (see link:http://openid.net/developers/specs/[OpenID Specifications,role=external,window=_blank]) |
| * OAuth2 |
| * LDAP |
| * Google accounts (on googlesource.com) |
| * SAML |
| * Kerberos |
| * 3rd party SSO |
| |
| === NoteDb |
| |
| Server side data storage for Gerrit is broken down into two different |
| categories: |
| |
| * Git repository data |
| * Gerrit metadata |
| |
| The Git repository data is the Git object database used to store |
| already submitted revisions, as well as all uploaded (proposed) |
| changes. Gerrit uses the standard Git repository format, and |
| therefore requires direct filesystem access to the repositories. |
| All repository data is stored in the filesystem and accessed through |
| the JGit library. Repository data can be stored on remote servers |
| accessible through NFS or SMB, but the remote directory must |
| be mounted on the Gerrit server as part of the local filesystem |
| namespace. Remote filesystems are likely to perform worse than |
| local ones, due to Git disk IO behavior not being optimized for |
| remote access. |
| |
| The Gerrit metadata contains a summary of the available changes, all |
| comments (published and drafts), and individual user account |
| information. |
| |
| Gerrit metadata is also stored in Git, with the commits marking the |
| historical state of metadata. Data is stored in the trees associated |
| with the commits, typically using Git config file or JSON as the base |
| format. For metadata, there are 3 types of data: changes, accounts and |
| groups. |
| |
| Accounts are stored in a special Git repository `All-Users`. |
| |
| Accounts can be grouped in groups. Gerrit has a built-in group system, |
| but can also interface to external group system (eg. Google groups, |
| LDAP). The built-in groups are stored in `All-Users`. |
| |
| Draft comments are stored in `All-Users` too. |
| |
| Permissions are stored in Git, in a branch `refs/meta/config` for the |
| repository. Repository configuration (including permissions) supports |
| single inheritance, with the `All-Projects` repository containing |
| site-wide defaults. |
| |
| Code review metadata is stored in Git, alongside the code under |
| review. Metadata includes change status, votes, comments. This review |
| metadata is stored in NoteDb along with the submitted code and code |
| under review. Hence, the review history can be exported with `git |
| clone --mirror` by anyone with sufficient permissions. |
| |
| == Permissions |
| |
| Permissions are specified on branch names, and given to groups. For |
| example, |
| |
| ``` |
| [access "refs/heads/stable/*"] |
| push = group Release-Engineers |
| ``` |
| |
| this provides a rule, granting Release-Engineers push permission for |
| stable branches. |
| |
| There are fundamentally two types of permissions: |
| |
| * Write permissions (who can vote, push, submit etc.) |
| |
| * Read permissions (who can see data) |
| |
| Read permissions need special treatment across Gerrit, because Gerrit |
| should only surface data (including repository existence) if a user |
| has read permission. This means that |
| |
| * The git wire protocol support must omit references from |
| advertisement if the user lacks read permissions |
| |
| * Uploads through the git wire protocol must refuse commits that are |
| based on SHA-1s for data that the user can't see. |
| |
| * Tags are only visible if their commits are visible to user through a |
| non-tag reference. |
| |
| Metadata (eg. OAuth credentials) is also stored in Git. Existing |
| endpoints must refuse creating branches or changes that expose these |
| metadata or allow changes to them. |
| |
| |
| === Indexing |
| |
| Almost all data is stored as Git, but Git only supports fast lookup by |
| SHA-1 or by ref (branch) name. Therefore Gerrit also has an indexing |
| system (powered by Lucene by default) for other types of queries. |
| There are 4 indices: |
| |
| * Project index - find repositories by name, parent project, etc. |
| * Account index - find accounts by name, email, etc. |
| * Group index - find groups by name, owner, description etc. |
| * Change index - find changes by file, status, modification date etc. |
| |
| The base entities are characterized by SHA-1s. Storing the |
| characterizing SHA-1s allows detection of stale index entries. |
| |
| == Plug-in architecture |
| |
| Gerrit has a plug-in architecture. Plugins can be installed by |
| dropping them into $site_directory/plugins, or at runtime through |
| plugin SSH commands, or the plugin REST API. |
| |
| === Backend plugins |
| |
| At runtime, code can be loaded from a `.jar` file. This code can hook |
| into predefined extension points. A common use of plugins is to have |
| Gerrit interoperate with site-specific tools, such as CI-systems or |
| issue trackers. |
| |
| // list some notable extension points, and notable plugins |
| // link to plugin development |
| |
| Some backend plugins expose the JVM for scripting use (eg. Groovy, |
| Scala), so plugins can be written without having to setup a Java |
| development environment. |
| |
| // Luca to expand: how do script plugins load their scripts? |
| |
| === Frontend plugins |
| |
| The UI can be extended using Frontend plugins. This is useful for |
| changing the look & feel of Gerrit, but it can also be used to surface |
| data from systems that aren't integrated with the Gerrit backend, eg. |
| CI systems or code coverage providers. |
| |
| // FE team to write a bit more: |
| // * how to load ? |
| // * XSRF, CORS ? |
| |
| == Internationalization and Localization |
| |
| As a source code review system for open source projects, where the |
| commonly preferred language for communication is typically English, |
| Gerrit does not make internationalization or localization a priority. |
| |
| The majority of Gerrit's users will be writing change descriptions |
| and comments in English, and therefore an English user interface |
| is usable by the target user base. |
| |
| |
| == Accessibility Considerations |
| |
| // UI team to rewrite this. |
| |
| Whenever possible Gerrit displays raw text rather than image icons, |
| so screen readers should still be able to provide useful information |
| to blind persons accessing Gerrit sites. |
| |
| Standard HTML hyperlinks are used rather than HTML div or span tags |
| with click listeners. This provides two benefits to the end-user. |
| The first benefit is that screen readers are optimized to locating |
| standard hyperlink anchors and presenting them to the end-user as |
| a navigation action. The second benefit is that users can use |
| the 'open in new tab/window' feature of their browser whenever |
| they choose. |
| |
| When possible, Gerrit uses the ARIA properties on DOM widgets to |
| provide hints to screen readers. |
| |
| |
| == Browser Compatibility |
| |
| Gerrit requires a JavaScript enabled browser. |
| |
| // UI team to add section on minimum browser requirements. |
| |
| As Gerrit is a pure JavaScript application on the client side, with |
| no server side rendering fallbacks, the browser must support modern |
| JavaScript semantics in order to access the Gerrit web application. |
| Dumb clients such as `lynx`, `wget`, `curl`, or even many search engine |
| spiders are not able to access Gerrit content. |
| |
| All of the content stored within Gerrit is also available through |
| other means, such as gitweb or the `git://` protocol. Any existing |
| search engine crawlers can index the server-side HTML served by a code |
| browser, and thus can index the majority of the changes which might |
| appear in Gerrit. Therefore the lack of support for most search engine |
| crawlers is a non-issue for most Gerrit deployments. |
| |
| |
| == Product Integration |
| |
| Gerrit optionally surfaces links to HTML pages in a code browser. The |
| links are configurable, and Gerrit comes with a built-in code browser, |
| called Gitiles. |
| |
| Gerrit integrates with some types of corporate single-sign-on (SSO) |
| solutions, typically by having the SSO authentication be performed |
| in a reverse proxy web server and then blindly trusting that all |
| incoming connections have been authenticated by that reverse proxy. |
| When configured to use this form of authentication, Gerrit does |
| not integrate with OpenID providers. |
| |
| When installing Gerrit, administrators may optionally include an |
| HTML header or footer snippet which may include user tracking code, |
| such as that used by Google Analytics. This is a per-instance |
| configuration that must be done by hand, and is not supported |
| out of the box. Other site trackers instead of Google Analytics |
| can be used, as the administrator can supply any HTML/JavaScript |
| they choose. |
| |
| Gerrit does not integrate with any Google service, or any other |
| services other than those listed above. |
| |
| Plugins (see above) can be used to drive product integrations from the |
| Gerrit side. Products that support Gerrit explicitly can use the REST |
| API or the SSH API to contact Gerrit. |
| |
| |
| == Privacy Considerations |
| |
| Gerrit stores the following information per user account: |
| |
| * Full Name |
| * Preferred Email Address |
| |
| The full name and preferred email address fields are shown to any |
| site visitor viewing a page containing a change uploaded by the |
| account owner, or containing a published comment written by the |
| account owner. |
| |
| Showing the full name and preferred email is approximately the same |
| risk as the `From` header of an email posted to a public mailing |
| list that maintains archives, and Gerrit treats these fields in |
| much the same way that a mailing list archive might handle them. |
| Users who don't want to expose this information should either not |
| participate in a Gerrit based online community, or open a new email |
| address dedicated for this use. |
| |
| As the Gerrit UI data is only available through XSRF protected |
| JSON-RPC calls, "screen-scraping" for email addresses is difficult, |
| but not impossible. It is unlikely a spammer will go through the |
| effort required to code a custom scraping application necessary |
| to cull email addresses from published Gerrit comments. In most |
| cases these same addresses would be more easily obtained from the |
| project's mailing list archives. |
| |
| The user's name and email address is stored unencrypted in the |
| link:config-accounts.html#all-users[All-Users] repository. |
| |
| == Spam and Abuse Considerations |
| |
| There is no spam protection for the Git protocol upload path. |
| Uploading a change successfully requires a pre-existing account, and a |
| lot of up-front effort. |
| |
| Gerrit makes no attempt to detect spam changes or comments in the web |
| UI. To post and publish a comment a client must sign in and then use |
| the XSRF protected JSON-RPC interface to publish the draft on an |
| existing change record. |
| |
| Absence of SPAM handling is based upon the idea that Gerrit caters to |
| a niche audience, and will therefore be unattractive to spammers. In |
| addition, it is not a factor for corporate, on-premise deployments. |
| |
| |
| == Scalability |
| |
| Gerrit supports the Git wire protocol, and an API (one API for HTTP, |
| and one for SSH). |
| |
| The git wire protocol does a client/server negotiation to avoid |
| sending too much data. This negotation occupies a CPU, so the number |
| of concurrent push/fetch operations should be capped by the number of |
| CPUs. |
| |
| Clients on slow network connections may be network bound rather than |
| server side CPU bound, in which case a core may be effectively shared |
| with another user. Possible core sharing due to network bottlenecks |
| generally holds true for network connections running below 10 MiB/sec. |
| |
| Deployments for large, distributed companies can replicate Git data to |
| read-only replicas to offload fetch traffic. The read-only replicas |
| should also serve this data using Gerrit to ensure that permissions |
| are obeyed. |
| |
| The API serves requests of varying costs. Requests that originate in |
| the UI can block productivity, so care has been taken to optimize |
| these for latency, using the following techniques: |
| |
| * Async calls: the UI becomes responsive before some UI elements |
| finished loading |
| |
| * Caching: metadata is stored in Git, which is relatively expensive to |
| access. This is sped up by multiple caches. Metadata entities are |
| stored in Git, and can therefore be seen as immutable values keyed |
| by SHA-1, which is very amenable to caching. All SHA-1 keyed caches |
| can be persisted on local disk. |
| |
| The size (memory, disk) of these caches should be adapted to the |
| instance size (number of users, size and quantity of repositories) |
| for optimal performance. |
| |
| Git does not impose fundamental limits (eg. number of files per |
| change) on data. To ensure stability, Gerrit configures a number of |
| default limits for these. |
| |
| // add a link to the default settings. |
| |
| === Scaling team size |
| |
| A team of size N has N^2 possible interactions. As a result, features |
| that expose interactions with activities of other team members has a |
| quadratic cost in aggregate. The following features scale poorly with |
| large team sizes: |
| |
| * the change screen shows conflicting changes by default. This data is |
| cached, but updates to pending changes cause cache misses. For a |
| single change, the amount of work is proportional to the number of |
| pending changes, so in aggregate, the cost of this feature is |
| quadratic in the team size. |
| |
| * the change screen shows if a change is mergeable to the target |
| branch. If the target branch moves quickly (large developer team), |
| this causes cache misses. In aggregate, the cost of this feature is |
| also quadratic. |
| |
| Both features should be turned off for repositories that involve 1000s |
| of developers. |
| |
| === Browser performance |
| |
| // say something about browser performance tuning. |
| |
| === Real life numbers |
| |
| |
| Gerrit is designed for very large projects, both open source and |
| proprietary commercial projects. For a single Gerrit process, the |
| following limits are known to work: |
| |
| .Observed maximums |
| [options="header"] |
| |====================================================== |
| |Parameter | Maximum | Deployment |
| |Projects | 50,000 | gerrithub.io |
| |Contributors | 150,000 | eclipse.org |
| |Bytes/repo | 100G | Qualcomm internal |
| |Changes/repo | 300k | Qualcomm internal |
| |Revisions/Change | 300 | Qualcomm internal |
| |Reviewers/Change | 87 | Qualcomm internal |
| |====================================================== |
| |
| |
| // find some numbers for these stats: |
| // |Files/repo | ? | |
| // |Files/Change | ? | |
| // |Comments/Change | ? | |
| // |max QPS/CPU | ? | |
| |
| |
| Google runs a horizontally scaled deployment. We have seen the |
| following per-JVM maximums: |
| |
| .Observed maximums (googlesource.com) |
| [options="header"] |
| |====================================================== |
| |Parameter | Maximum | Deployment |
| |Files/repo | 500,000 | chromium-review |
| |Bytes/repo | 12G | chromium-review |
| |Changes/repo | 500k | chromium-review |
| |Revisions/Change | 1900 | chromium-review |
| |Files/Change | 10,000| android-review |
| |Comments/Change | 1,200 | chromium-review |
| |====================================================== |
| |
| |
| == Redundancy & Reliability |
| |
| Gerrit is structured as a single JVM process, reading and writing to a |
| single file system. If there are hardware failures in the machine |
| running the JVM, or the storage holding the repositories, there is no |
| recourse; on failure, errors will be returned to the client. |
| |
| Deployments needing more stringent uptime guarantees can use |
| replication/multi-master setup, which ensures availability and |
| geographical distribution, at the cost of slower write actions. |
| |
| // TODO: link. |
| |
| === Backups |
| |
| Using the standard replication plugin, Gerrit can be configured |
| to replicate changes made to the local Git repositories over any |
| standard Git transports. After the plugin is installed, remote |
| destinations can be configured in `'$site_path'/etc/replication.conf` |
| to send copies of all changes over SSH to other servers, or to the |
| Amazon S3 blob storage service. |
| |
| |
| == Logging Plan |
| |
| Gerrit stores Apache style HTTPD logs, as well as ERROR/INFO messages |
| from the Java logger, under `$site_dir/logs/`. |
| |
| Published comments contain a publication date, so users can judge |
| when the comment was posted and decide if it was "recent" or not. |
| Only the timestamp is stored in the database, the IP address of |
| the comment author is not stored. |
| |
| Changes uploaded over the SSH daemon from `git push` have the |
| standard Git reflog updated with the date and time that the upload |
| occurred, and the Gerrit account identity of who did the upload. |
| Changes submitted and merged into a branch also update the |
| Git reflog. These logs are available only to the Gerrit site |
| administrator, and they are not replicated through the automatic |
| replication noted earlier. These logs are primarily recorded for an |
| "oh s**t" moment where the administrator has to rewind data. In most |
| installations they are a waste of disk space. Future versions of |
| JGit may allow disabling these logs, and Gerrit may take advantage |
| of that feature to stop writing these logs. |
| |
| A web server positioned in front of Gerrit (such as a reverse proxy) |
| or the hosting servlet container may record access logs, and these |
| logs may be mined for usage information. This is outside of the |
| scope of Gerrit. |
| |
| |
| GERRIT |
| ------ |
| Part of link:index.html[Gerrit Code Review] |
| |
| SEARCHBOX |
| --------- |