gerrit/Documentation/dev-design.txt
Han-Wen Nienhuys 348a6030c7 Rename "slave" to "replica" in documentation and command-line
Per
https://en.wikipedia.org/wiki/Master/slave_(technology)#Terminology_concerns
the use "slave" is considered offsensive in some circles.

For compatibility reasons, the "container.slave" property is
maintained, and the --slave flag is an alias for --replica.

Change-Id: If57df22705dac9450c1cf9900f051d9123d7dd2c
2019-09-26 23:19:53 +02:00

636 lines
29 KiB
Plaintext

= Gerrit Code Review - System Design
== Objective
Gerrit is a web based code review system, facilitating online code
reviews for projects using the Git version control system.
Gerrit makes reviews easier by showing changes in a side-by-side
display, and allowing inline/file comments to be added by any reviewer.
Gerrit simplifies Git based project maintainership by permitting
any authorized user to submit changes to the master Git repository,
rather than requiring all approved changes to be merged in by
hand by the project maintainer. This functionality enables a more
centralized usage of Git.
== Background
Google developed Mondrian, a Perforce based code review tool to
facilitate peer-review of changes prior to submission to the central
code repository. Mondrian is not open source, as it is tied to the
use of Perforce and to many Google-only services, such as Bigtable.
Google employees have often described how useful Mondrian and its
peer-review process is to their day-to-day work.
Guido van Rossum open sourced portions of Mondrian within Rietveld,
a similar code review tool running on Google App Engine, but for
use with Subversion rather than Perforce. Rietveld is in common
use by many open source projects, facilitating their peer reviews
much as Mondrian does for Google employees. Unlike Mondrian and
the Google Perforce triggers, Rietveld is strictly advisory and
does not enforce peer-review prior to submission.
Git is a distributed version control system, wherein each repository
is assumed to be owned/maintained by a single user. There are no
inherent security controls built into Git, so the ability to read
from or write to a repository is controlled entirely by the host's
filesystem access controls. When multiple maintainers collaborate
on a single shared repository a high degree of trust is required,
as any collaborator with write access can alter the repository.
Gitosis provides tools to secure centralized Git repositories,
permitting multiple maintainers to manage the same project at once,
by restricting the access to only over a secure network protocol,
much like Perforce secures a repository by only permitting access
over its network port.
The Android Open Source Project (AOSP) was founded by Google by the
open source releasing of the Android operating system. AOSP has
selected Git as its primary version control tool. As many of the
engineers have a background of working with Mondrian at Google,
there is a strong desire to have the same (or better) feature set
available for Git and AOSP.
Gerrit Code Review started as a simple set of patches to Rietveld,
and was originally built to service AOSP. This quickly turned
into a fork as we added access control features that Guido van
Rossum did not want to see complicating the Rietveld code base. As
the functionality and code were starting to become drastically
different, a different name was needed. Gerrit calls back to the
original namesake of Rietveld, Gerrit Rietveld, a Dutch architect.
Gerrit 2.x is a complete rewrite of the Gerrit fork, completely
changing the implementation from Python on Google App Engine, to Java
on a J2EE servlet container and an SQL database.
Since Gerrit 3.x link:note-db.html[NoteDb] replaced the SQL database
and all metadata is now stored in Git.
* link:http://video.google.com/videoplay?docid=-8502904076440714866[Mondrian Code Review On The Web]
* link:https://github.com/rietveld-codereview/rietveld[Rietveld - Code Review for Subversion]
* link:http://eagain.net/gitweb/?p=gitosis.git;a=blob;f=README.rst;hb=HEAD[Gitosis README]
* link:http://source.android.com/[Android Open Source Project]
== Overview
Developers create one or more changes on their local desktop system,
then upload them for review to Gerrit using the standard `git push`
command line program, or any GUI which can invoke `git push` on
behalf of the user. Authentication and data transfer are handled
through SSH. Users are authenticated by username and public/private
key pair, and all data transfer is protected by the SSH connection
and Git's own data integrity checks.
Each Git commit created on the client desktop system is converted
into a unique change record which can be reviewed independently.
Change records are stored in NoteDb.
A summary of each newly uploaded change is automatically emailed
to reviewers, so they receive a direct hyperlink to review the
change on the web. Reviewer email addresses can be specified on the
`git push` command line, but typically reviewers are automatically
selected by Gerrit by identifying users who have change approval
permissions in the project.
Reviewers use the web interface to read the side-by-side or unified
diff of a change, and insert draft inline/file comments where
appropriate. A draft comment is visible only to the reviewer, until
they publish those comments. Published comments are automatically
emailed to the change author by Gerrit, and are CC'd to all other
reviewers who have already commented on the change.
When publishing comments reviewers are also given the opportunity
to score the change, indicating whether they feel the change is
ready for inclusion in the project, needs more work, or should be
rejected outright. These scores provide direct feedback to Gerrit's
change submit function.
After a change has been scored positively by reviewers, Gerrit
enables a submit button on the web interface. Authorized users
can push the submit button to have the change enter the project
repository. The equivalent in Subversion or Perforce would be
that Gerrit is invoking `svn commit` or `p4 submit` on behalf of
the web user pressing the button. Due to the way Git audit trails
are maintained, the user pressing the submit button does not need
to be the author of the change.
== Infrastructure
End-user web browsers make HTTP requests directly to Gerrit's
HTTP server. As nearly all of the user interface is implemented
through PolyGerrit, the majority of these requests are transmitting
compressed JSON payloads, with all HTML being generated within the
browser. Most responses are under 1 KB.
Gerrit's HTTP server side component is implemented as a standard
Java servlet, and thus runs within any J2EE servlet container.
Popular choices for deployments would be Tomcat or Jetty, as these
are high-quality open-source servlet containers that are readily
available for download.
End-user uploads are performed over SSH, so Gerrit's servlets also
start up a background thread to receive SSH connections through
an independent SSH port. SSH clients communicate directly with
this port, bypassing the HTTP server used by browsers.
Server side data storage for Gerrit is broken down into two different
categories:
* Git repository data
* Gerrit metadata
The Git repository data is the Git object database used to store
already submitted revisions, as well as all uploaded (proposed)
changes. Gerrit uses the standard Git repository format, and
therefore requires direct filesystem access to the repositories.
All repository data is stored in the filesystem and accessed through
the JGit library. Repository data can be stored on remote servers
accessible through NFS or SMB, but the remote directory must
be mounted on the Gerrit server as part of the local filesystem
namespace. Remote filesystems are likely to perform worse than
local ones, due to Git disk IO behavior not being optimized for
remote access.
The Gerrit metadata contains a summary of the available changes,
all comments (published and drafts), and individual user account
information. The metadata is mostly housed in the database (*1),
which can be located either on the same server as Gerrit, or on
a different (but nearby) server. Most installations would opt to
install both Gerrit and the metadata database on the same server,
to reduce administration overheads.
User authentication is handled by OpenID, and therefore Gerrit
requires that the OpenID provider selected by a user must be
online and operating in order to authenticate that user.
* link:http://www.kernel.org/pub/software/scm/git/docs/gitrepository-layout.html[Git Repository Format]
* link:http://openid.net/developers/specs/[OpenID Specifications]
*1 Although an effort is underway to eliminate the use of the
database altogether, and to store all the metadata directly in
the git repositories themselves. So far, as of Gerrit 2.2.1, of
all Gerrit's metadata, only the project configuration metadata
has been migrated out of the database and into the git
repositories for each project.
== Internationalization and Localization
As a source code review system for open source projects, where the
commonly preferred language for communication is typically English,
Gerrit does not make internationalization or localization a priority.
The majority of Gerrit's users will be writing change descriptions
and comments in English, and therefore an English user interface
is usable by the target user base.
Right-to-left (RTL) support is only barely considered within the
Gerrit code base. Some portions of the code have tried to take
RTL into consideration, while others probably need to be modified
before translating the UI to an RTL language.
== Accessibility Considerations
Whenever possible Gerrit displays raw text rather than image icons,
so screen readers should still be able to provide useful information
to blind persons accessing Gerrit sites.
Standard HTML hyperlinks are used rather than HTML div or span tags
with click listeners. This provides two benefits to the end-user.
The first benefit is that screen readers are optimized to locating
standard hyperlink anchors and presenting them to the end-user as
a navigation action. The second benefit is that users can use
the 'open in new tab/window' feature of their browser whenever
they choose.
When possible, Gerrit uses the ARIA properties on DOM widgets to
provide hints to screen readers.
== Browser Compatibility
Supporting non-JavaScript enabled browsers is a non-goal for Gerrit.
As Gerrit is a pure JavaScript application on the client side, with
no server side rendering fallbacks, the browser must support modern
JavaScript semantics in order to access the Gerrit web application.
Dumb clients such as `lynx`, `wget`, `curl`, or even many search engine
spiders are not able to access Gerrit content.
There are number of web browsers available with full JavaScript
support, and nearly every operating system (including any PDA-like
mobile phone) comes with one standard. Users who are committed
to developing changes for a Gerrit managed project can be expected
to be able to run a JavaScript enabled browser, as they also would
need to be running Git in order to contribute.
There are a number of open source browsers available, including
Firefox and Chromium. Users have some degree of choice in their
browser selection, including being able to build and audit their
browser from source.
The majority of the content stored within Gerrit is also available
through other means, such as gitweb or the `git://` protocol.
Any existing search engine spider can crawl the server-side HTML
produced by gitweb, and thus can index the majority of the changes
which might appear in Gerrit. Some engines may even choose to
crawl the native version control database, such as ohloh.net does.
Therefore the lack of support for most search engine spiders is a
non-issue for most Gerrit deployments.
== Product Integration
Gerrit integrates with an existing gitweb installation by optionally
creating hyperlinks to reference changes on the gitweb server.
Gerrit integrates with an existing git-daemon installation by
optionally displaying `git://` URLs for users to download a
change through the native Git protocol.
Gerrit integrates with any OpenID provider for user authentication,
making it easier for users to join a Gerrit site and manage their
authentication credentials to it. To make use of Google Accounts
as an OpenID provider easier, Gerrit has a shorthand "Sign in with
a Google Account" link on its sign-in screen. Gerrit also supports
a shorthand sign in link for Yahoo!. Other providers may also be
supported more directly in the future.
Site administrators may limit the range of OpenID providers to
a subset of "reliable providers". Users may continue to use
any OpenID provider to publish comments, but granted privileges
are only available to a user if the only entry point to their
account is through the defined set of "reliable OpenID providers".
This permits site administrators to require HTTPS for OpenID,
and to use only large main-stream providers that are trustworthy,
or to require users to only use a custom OpenID provider installed
alongside Gerrit Code Review.
Gerrit integrates with some types of corporate single-sign-on (SSO)
solutions, typically by having the SSO authentication be performed
in a reverse proxy web server and then blindly trusting that all
incoming connections have been authenticated by that reverse proxy.
When configured to use this form of authentication, Gerrit does
not integrate with OpenID providers.
When installing Gerrit, administrators may optionally include an
HTML header or footer snippet which may include user tracking code,
such as that used by Google Analytics. This is a per-instance
configuration that must be done by hand, and is not supported
out of the box. Other site trackers instead of Google Analytics
can be used, as the administrator can supply any HTML/JavaScript
they choose.
Gerrit does not integrate with any Google service, or any other
services other than those listed above.
== Privacy Considerations
Gerrit stores the following information per user account:
* Full Name
* Preferred Email Address
* Mailing Address '(Optional, Encrypted)'
* Country '(Optional, Encrypted)'
* Phone Number '(Optional, Encrypted)'
* Fax Number '(Optional, Encrypted)'
The full name and preferred email address fields are shown to any
site visitor viewing a page containing a change uploaded by the
account owner, or containing a published comment written by the
account owner.
Showing the full name and preferred email is approximately the same
risk as the `From` header of an email posted to a public mailing
list that maintains archives, and Gerrit treats these fields in
much the same way that a mailing list archive might handle them.
Users who don't want to expose this information should either not
participate in a Gerrit based online community, or open a new email
address dedicated for this use.
As the Gerrit UI data is only available through XSRF protected
JSON-RPC calls, "screen-scraping" for email addresses is difficult,
but not impossible. It is unlikely a spammer will go through the
effort required to code a custom scraping application necessary
to cull email addresses from published Gerrit comments. In most
cases these same addresses would be more easily obtained from the
project's mailing list archives.
The user's name and email address is stored unencrypted in the
link:config-accounts.html#all-users[All-Users] repository.
The snail-mail mailing address, country, and phone and fax numbers
are gathered to help project leads contact the user should there
be a legal question regarding any change they have uploaded.
These sensitive fields are immediately encrypted upon receipt with
a GnuPG public key, and stored "off site" in another data store,
isolated from the main Gerrit change data. Gerrit does not have
access to the matching private key, and as such cannot decrypt the
information. Therefore these fields are write-once in Gerrit, as not
even the account owner can recover the values they previously stored.
It is expected that the address information would only need to be
decrypted and revealed with a valid court subpoena, but this is
really left to the discretion of the Gerrit site administrator as
to when it is reasonable to reveal this information to a 3rd party.
== Spam and Abuse Considerations
Gerrit makes no attempt to detect spam changes or comments. The
somewhat high barrier to entry makes it unlikely that a spammer
will target Gerrit.
To upload a change, the client must speak the native Git protocol
embedded in SSH, with some custom Gerrit semantics added on top.
The client must have their public key already stored in the Gerrit
database, which can only be done through the XSRF protected
JSON-RPC interface. The level of effort required to construct
the necessary tools to upload a well-formatted change that isn't
rejected outright by the Git and Gerrit checksum validations is
too high to for a spammer to get any meaningful return.
To post and publish a comment a client must sign in with an OpenID
provider and then use the XSRF protected JSON-RPC interface to
publish the draft on an existing change record. Again, the level of
effort required to implement the Gerrit specific XSRF protections
and the JSON-RPC payload format necessary to post a draft and then
publish that draft is simply too high for a spammer to bother with.
Both of these assumptions are also based upon the idea that Gerrit
will be a lot less popular than blog software, and thus will be
running on a lot fewer websites. Spammers therefore have very little
returned benefit for getting over the protocol hurdles.
These assumptions may need to be revisited in the future if any
public Gerrit site actually notices spam.
== Latency
Gerrit targets for sub-250 ms per page request, mostly by using
very compact JSON payloads between client and server. However, as
most of the serving stack (network, hardware, metadata
database) is out of control of the Gerrit developers, no real
guarantees can be made about latency.
== Scalability
Gerrit is designed for a very large scale open source project, or
large commercial development project. Roughly this amounts to
parameters such as the following:
.Design Parameters
[options="header"]
|======================================================
|Parameter | Default Maximum | Estimated Maximum
|Projects | 1,000 | 10,000
|Contributors | 1,000 | 50,000
|Changes/Day | 100 | 2,000
|Revisions/Change | 20 | 20
|Files/Change | 50 | 16,000
|Comments/File | 100 | 100
|Reviewers/Change | 8 | 8
|======================================================
Out of the box, Gerrit will handle the "Default Maximum". Site
administrators may reconfigure their servers by editing gerrit.config
to run closer to the estimated maximum if sufficient memory is made
available to the JVM and the relevant cache.*.memoryLimit variables
are increased from their defaults.
=== Discussion
Very few, if any open source projects have more than a handful of
Git repositories associated with them. Since Gerrit treats each
Git repository as a project, an upper limit of 10,000 projects
is reasonable. If a site has more than 1,000 projects, administrators
should increase
link:config-gerrit.html#cache.name.memoryLimit[`cache.projects.memoryLimit`]
to match.
Almost no open source project has 1,000 contributors over all time,
let alone on a daily basis. This default figure of 1,000 was WAG'd by
looking at PR statements published by cell phone companies picking
up the Android operating system. If all of the stated employees in
those PR statements were working on *only* the open source Android
repositories, we might reach the 1,000 estimate listed here. Knowing
these companies as being very closed-source minded in the past, it
is very unlikely all of their Android engineers will be working on
the open source repository, and thus 1,000 is a very high estimate.
The upper maximum of 50,000 contributors is based on existing
installations that are already handling quite a bit more than the
default maximum of 1,000 contributors. Given how the user data is
stored and indexed, supporting 50,000 contributor accounts (or more)
is easily possible for a server. If a server has more than 1,000
*active* contributors,
link:config-gerrit.html#cache.name.memoryLimit[`cache.accounts.memoryLimit`]
should be increased by the site administrator, if sufficient RAM
is available to the host JVM.
The estimate of 100 changes per day was WAG'd off some estimates
originally obtained from Android's development history. Writing a
good change that will be accepted through a peer-review process
takes time. The average engineer may need 4-6 hours per change just
to write the code and unit tests. Proper design consideration and
additional but equally important tasks such as meetings, interviews,
training, and eating lunch will often pad the engineer's day out
such that suitable changes are only posted once a day, or once
every other day. For reference, the entire Linux kernel has an
average of only 79 changes/day. If more than 100 changes are active
per day, site administrators should consider increasing the
link:config-gerrit.html#cache.name.memoryLimit[`cache.diff.memoryLimit`]
and `cache.diff_intraline.memoryLimit`.
On average any given change will need to be modified once to address
peer review comments before the final revision can be accepted by the
project. Executing these revisions also eats into the contributor's
time, and is another factor limiting the number of changes/day
accepted by the Gerrit instance. However, even though this implies
only 2 revisions/change, many existing Gerrit installations have seen
20 or more revisions/change, when new contributors are learning the
project's style and conventions.
On average, each change will have 2 reviewers, a human and an
automated test bed system. Usually this would be the project lead, or
someone who is familiar with the code being modified. The time
required to comment further reduces the time available for writing
one's own changes. However, existing Gerrit installations have seen 8
or more reviewers frequently show up on changes that impact many
functional areas, and therefore it is reasonable to expect 8 or more
reviewers to be able to work together on a single change.
Existing installations have successfully processed change reviews with
more than 16,000 files per change. However, since 16,000 modified/new
files is a massive amount of code to review, it is more typical to see
less than 10 files modified in any single change. Changes larger than
10 files are typically merges, for example integrating the latest
version of an upstream library, where the reviewer has little to do
beyond verifying the project compiles and passes a test suite.
=== CPU Usage - Web UI
Gerrit's web UI would require on average `4+F+F*C` HTTP requests to
review a change and post comments. Here `F` is the number of files
modified by the change, and `C` is the number of inline/file comments
left by the reviewer per file. The constant 4 accounts for the request
to load the reviewer's dashboard, to load the change detail page,
to publish the review comments, and to reload the change detail
page after comments are published.
This WAG'd estimate boils down to 216,000 HTTP requests per day
(QPD). Assuming these are evenly distributed over an 8 hour work day
in a single time zone, we are looking at approximately 7.5 queries
per second (QPS).
----
QPD = Changes_Day * Revisions_Change * Reviewers_Change * (4 + F + F * C)
= 2,000 * 2 * 1 * (4 + 10 + 10 * 4)
= 216,000
QPS = QPD / 8_Hours / 60_Minutes / 60_Seconds
= 7.5
----
Gerrit serves most requests in under 60 ms when using the loopback
interface and a single processor. On a single CPU system there is
sufficient capacity for 16 QPS. A dual processor system should be
more than sufficient for a site with the estimated load described above.
Given a more realistic estimate of 79 changes per day (from the
Linux kernel) suggests only 8,532 queries per day, and a much lower
0.29 QPS when spread out over an 8 hour work day.
=== CPU Usage - Git over SSH/HTTP
A 24 core server is able to handle ~25 concurrent `git fetch`
operations per second. The issue here is each concurrent operation
demands one full core, as the computation is almost entirely server
side CPU bound. 25 concurrent operations is known to be sufficient to
support hundreds of active developers and 50 automated build servers
polling for updates and building every change. (This data was derived
from an actual installation's performance.)
Because of the distributed nature of Git, end-users don't need to
contact the central Gerrit Code Review server very often. For `git
fetch` traffic, link:pgm-daemon.html[replica mode] is known to be an
effective way to offload traffic from the main server, permitting it
to scale to a large user base without needing an excessive number of
cores in a single system.
Clients on very slow network connections (for example home office
users on VPN over home DSL) may be network bound rather than server
side CPU bound, in which case a core may be effectively shared with
another user. Possible core sharing due to network bottlenecks
generally holds true for network connections running below 10 MiB/sec.
If the server's own network interface is 1 Gib/sec (Gigabit Ethernet),
the system can really only serve about 10 concurrent clients at the
10 MiB/sec speed, no matter how many cores it has.
=== Disk Usage
The average size of a revision in the Linux kernel once compressed by
Git is 2,327 bytes, or roughly 2 KiB. Over the course of a year a
Gerrit server running with the estimated maximum parameters above might
see an introduction of 1.4 GiB over the total set of 10,000 projects
hosted in that server. This figure assumes the majority of the content
is human written source code, and not large binary blobs such as disk
images or media files.
Production Gerrit installations have been tested, and are known to
handle Git repositories in the multigigabyte range, storing binary
files, ranging in size from a few kilobytes (for example compressed
icons) to 800+ megabytes (firmware images, large uncompressed original
artwork files). Best practices encourage breaking very large binary
files into their Git repositories based on access, to prevent desktop
clients from needing to clone unnecessary materials (for example a C
developer does not need every 800+ megabyte firmware image created by
the product's quality assurance team).
== Redundancy & Reliability
Gerrit largely assumes that the local filesystem where Git repository
data is stored is always available. Important data written to disk
is also forced to the platter with an `fsync()` once it has been
fully written. If the local filesystem fails to respond to reads
or becomes corrupt, Gerrit has no provisions to fallback or retry
and errors will be returned to clients.
Gerrit largely assumes that the metadata database is online and
answering both read and write queries. Query failures immediately
result in the operation aborting and errors being returned to the
client, with no retry or fallback provisions.
Due to the relatively small scale described above, it is very likely
that the Git filesystem and metadata database are all housed on the
same server that is running Gerrit. If any failure arises in one of
these components, it is likely to manifest in the others too. It is
also likely that the administrator cannot be bothered to deploy a
cluster of load-balanced server hardware, as the scale and expected
load does not justify the hardware or management costs.
Most deployments caring about reliability will setup a warm-spare
standby system and use a manual fail-over process to switch from the
failed system to the warm-spare.
As Git is a distributed version control system, and open source
projects tend to have contributors from all over the world, most
contributors will be able to tolerate a Gerrit down time of several
hours while the administrator is notified, signs on, and brings the
warm-spare up. Pending changes are likely to need at least 24 hours
of time on the Gerrit site anyway in order to ensure any interested
parties around the world have had a chance to comment. This expected
lag largely allows for some downtime in a disaster scenario.
=== Backups
Using the standard replication plugin, Gerrit can be configured
to replicate changes made to the local Git repositories over any
standard Git transports. After the plugin is installed, remote
destinations can be configured in `'$site_path'/etc/replication.conf`
to send copies of all changes over SSH to other servers, or to the
Amazon S3 blob storage service.
== Logging Plan
Gerrit does not maintain logs on its own.
Published comments contain a publication date, so users can judge
when the comment was posted and decide if it was "recent" or not.
Only the timestamp is stored in the database, the IP address of
the comment author is not stored.
Changes uploaded over the SSH daemon from `git push` have the
standard Git reflog updated with the date and time that the upload
occurred, and the Gerrit account identity of who did the upload.
Changes submitted and merged into a branch also update the
Git reflog. These logs are available only to the Gerrit site
administrator, and they are not replicated through the automatic
replication noted earlier. These logs are primarily recorded for an
"oh s**t" moment where the administrator has to rewind data. In most
installations they are a waste of disk space. Future versions of
JGit may allow disabling these logs, and Gerrit may take advantage
of that feature to stop writing these logs.
A web server positioned in front of Gerrit (such as a reverse proxy)
or the hosting servlet container may record access logs, and these
logs may be mined for usage information. This is outside of the
scope of Gerrit.
GERRIT
------
Part of link:index.html[Gerrit Code Review]
SEARCHBOX
---------