blob: c2c8b4c245e2fa059f90325d12f81fccc700ca6a [file] [log] [blame]
<html>
<head>
<title>Git on DHT Schema</title>
<style type='text/css'>
body { font-size: 10pt; }
h1 { font-size: 16pt; }
h2 { font-size: 12pt; }
h3 { font-size: 10pt; }
body {
margin-left: 8em;
margin-right: 8em;
}
h1 { margin-left: -3em; }
h2 { margin-left: -2em; }
h3 { margin-left: -1em; }
hr { margin-left: -4em; margin-right: -4em; }
.coltoc {
font-size: 8pt;
font-family: monospace;
}
.rowkey {
margin-left: 1em;
padding-top: 0.2em;
padding-left: 1em;
padding-right: 1em;
width: 54em;
border: 1px dotted red;
background-color: #efefef;
white-space: nowrap;
}
.rowkey .header {
font-weight: bold;
padding-right: 1em;
}
.rowkey .var {
font-style: italic;
font-family: monospace;
}
.rowkey .lit {
font-weight: bold;
font-family: monospace;
}
.rowkey .example {
font-family: monospace;
}
.rowkey p {
white-space: normal;
}
.colfamily {
margin-top: 0.5em;
padding-top: 0.2em;
padding-left: 1em;
padding-right: 1em;
width: 55em;
border: 1px dotted blue;
background-color: #efefef;
white-space: nowrap;
}
.colfamily .header {
font-weight: bold;
padding-right: 1em;
}
.colfamily .var {
font-style: italic;
font-family: monospace;
}
.colfamily .lit {
font-family: monospace;
}
.colfamily .example {
font-family: monospace;
}
.colfamily p {
white-space: normal;
}
.summary_table {
border-collapse: collapse;
border-spacing: 0;
}
.summary_table .desc {
font-size: 8pt;
white-space: nowrap;
text-align: right;
width: 20em;
}
.summary_table td {
border: 1px dotted lightgray;
padding-top: 2px;
padding-bottom: 2px;
padding-left: 5px;
padding-right: 5px;
vertical-align: top;
}
.summary_table tr.no_border td {
border: none;
}
</style>
</head>
<body>
<h1>Git on DHT Schema</h1>
<p>Storing Git repositories on a Distributed Hash Table (DHT) may
improve scaling for large traffic, but also simplifies management when
there are many small repositories.</p>
<h2>Table of Contents</h2>
<ul>
<li><a href="#concepts">Concepts</a></li>
<li><a href="#summary">Summary</a></li>
<li><a href="#security">Data Security</a></li>
<li>Tables:
<ul>
<li><a href="#REPOSITORY_INDEX">Table REPOSITORY_INDEX</a>
(
<a href="#REPOSITORY_INDEX.id" class="toccol">id</a>
)</li>
<li><a href="#REPOSITORY">Table REPOSITORY</a>
(
<a href="#REPOSITORY.chunk-info" class="toccol">chunk-info</a>,
<a href="#REPOSITORY.cached-pack" class="toccol">cached-pack</a>
)</li>
<li><a href="#REF">Table REF</a>
(
<a href="#REF.target" class="toccol">target</a>
)</li>
<li><a href="#OBJECT_INDEX">Table OBJECT_INDEX</a>
(
<a href="#OBJECT_INDEX.info" class="toccol">info</a>
)</li>
<li><a href="#CHUNK">Table CHUNK</a>
(
<a href="#CHUNK.chunk" class="toccol">chunk</a>,
<a href="#CHUNK.index" class="toccol">index</a>,
<a href="#CHUNK.meta" class="toccol">meta</a>
)</li>
</ul>
</li>
<li>Protocol Messages:
<ul>
<li><a href="#message_RefData">RefData</a></li>
<li><a href="#message_ObjectInfo">ObjectInfo</a></li>
<li><a href="#message_ChunkInfo">ChunkInfo</a></li>
<li><a href="#message_ChunkMeta">ChunkMeta</a></li>
<li><a href="#message_CachedPackInfo">CachedPackInfo</a></li>
</ul>
</li>
</ul>
<a name="concepts"><h2>Concepts</h2></a>
<p><i>Git Repository</i>: Stores the version control history for a
single project. Each repository is a directed acyclic graph (DAG)
composed of objects. Revision history for a project is described by a
commit object pointing to the complete set of files that make up that
version of the project, and a pointer to the commit that came
immediately before it. Repositories also contain references,
associating a human readable branch or tag name to a specific commit
object. Tommi Virtanen has a
<a href="http://eagain.net/articles/git-for-computer-scientists/">more
detailed description of the Git DAG</a>.</p>
<p><i>Object</i>: Git objects are named by applying the SHA-1 hash
algorithm to their contents. There are 4 object types: commit, tree,
blob, tag. Objects are typically stored deflated using libz deflate,
but may also be delta compressed against another similar object,
further reducing the storage required. The big factor for Git
repository size is usually object count, e.g. the linux-2.6 repository
contains 1.8 million objects.</p>
<p><i>Reference</i>: Associates a human readable symbolic name, such
as <code>refs/heads/master</code> to a specific Git object, usually a
commit or tag. References are updated to point to the most recent
object whenever changes are committed to the repository.</p>
<p><i>Git Pack File</i>: A container stream holding many objects in a
highly compressed format. On the local filesystem, Git uses pack files
to reduce both inode and space usage by combining millions of objects
into a single data stream. On the network, Git uses pack files as the
basic network protocol to transport objects from one system's
repository to another.</p>
<p><i>Garbage Collection</i>: Scanning the Git object graph to locate
objects that are reachable, and others that are unreachable. Git also
generally performs data recompression during this task to produce more
optimal deltas between objects, reducing overall disk usage and data
transfer sizes. This is independent of any GC that may be performed by
the DHT to clean up old cells.</p>
<p>The basic storage strategy employed by this schema is to break a
standard Git pack file into chunks, approximately 1 MiB in size. Each
chunk is stored as one row in the <a href="#CHUNK">CHUNK</a> table.
During reading, chunks are paged into the application on demand, but
may also be prefetched using prefetch hints. Rules are used to break
the standard pack into chunks, these rules help to improve reference
locality and reduce the number of chunk loads required to service
common operations. In a nutshell, the DHT is used as a virtual memory
system for pages about 1 MiB in size.</p>
<a name="summary"><h2>Summary</h2></a>
<p>The schema consists of a handful of tables. Size estimates are
given for one copy of the linux-2.6 Git repository, a relative tortue
test case that contains 1.8 million objects and is 425 MiB when stored
on the local filesystem. All sizes are before any replication made by
the DHT, or its underlying storage system.</p>
<table style='margin-left: 2em' class='summary_table'>
<tr>
<th>Table</th>
<th>Rows</th>
<th>Cells/Row</th>
<th>Bytes</th>
<th>Bytes/Row</th>
</tr>
<tr>
<td><a href="#REPOSITORY_INDEX">REPOSITORY_INDEX</a>
<div class='desc'>Map host+path to surrogate key.</div></td>
<td align='right'>1</td>
<td align='right'>1</td>
<td align='right'>&lt; 100 bytes</td>
<td align='right'>&lt; 100 bytes</td>
</tr>
<tr>
<td><a href="#REPOSITORY">REPOSITORY</a>
<div class='desc'>Accounting and replica management.</div></td>
<td align='right'>1</td>
<td align='right'>403</td>
<td align='right'>65 KiB</td>
<td align='right'>65 KiB</td>
</tr>
<tr>
<td><a href="#REF">REF</a>
<div class='desc'>Bind branch/tag name to Git object.</div></td>
<td align='right'>211</td>
<td align='right'>1</td>
<td align='right'>14 KiB</td>
<td align='right'>67 bytes</td>
</tr>
<tr>
<td><a href="#OBJECT_INDEX">OBJECT_INDEX</a>
<div class='desc'>Locate Git object by SHA-1 name.</div></td>
<td align='right'>1,861,833</td>
<td align='right'>1</td>
<td align='right'>154 MiB</td>
<td align='right'>87 bytes</td>
</tr>
<tr>
<td><a href="#CHUNK">CHUNK</a>
<div class='desc'>Complete Git object storage.</div></td>
<td align='right'>402</td>
<td align='right'>3</td>
<td align='right'>417 MiB</td>
<td align='right'>~ 1 MiB</td>
</tr>
<tr class='no_border'>
<td align='right'><i>Total</i></td>
<td align='right'>1,862,448</td>
<td align='right'></td>
<td align='right'>571 MiB</td>
<td align='right'></td>
</tr>
</table>
<a name="security"><h2>Data Security</h2></a>
<p>If data encryption is necessary to protect file contents, the <a
href="#CHUNK.chunk">CHUNK.chunk</a> column can be encrypted with a
block cipher such as AES. This column contains the revision commit
messages, file paths, and file contents. By encrypting one column, the
majority of the repository data is secured. As each cell value is
about 1 MiB and contains a trailing 4 bytes of random data, an ECB
mode of operation may be sufficient. Because the cells are already
very highly compressed using the Git data compression algorithms,
there is no increase in disk usage due to encryption.</p>
<p>Branch and tag names (<a href="#REF">REF</a> row keys) are not
encrypted. If these need to be secured the portion after the ':' would
need to be encrypted with a block cipher. However these strings are
very short and very common (HEAD, refs/heads/master, refs/tags/v1.0),
making encryption difficult. A variation on the schema might move all
rows for a repository into a single protocol messsage, then encrypt
the protobuf into a single cell. Unfortunately this strategy has a
high update cost, and references change frequently.</p>
<p>Object SHA-1 names (<a href="#OBJECT_INDEX">OBJECT_INDEX</a> row
keys and <a href="#CHUNK.index">CHUNK.index</a> values) are not
encrypted. This allows a reader to determine if a repository contains
a specific revision, but does not allow them to inspect the contents
of the revision. The CHUNK.index column could also be encrypted with a
block cipher when CHUNK.chunk is encrypted (see above), however the
OBJECT_INDEX table row keys cannot be encrypted if abbrevation
expansion is to be supported for end-users of the repository. The row
keys must be unencrypted as abbreviation resolution is performed by a
prefix range scan over the keys.</p>
<p>The remaining tables and columns contain only statistics (e.g.
object counts or cell sizes), or internal surrogate keys
(repository_id, chunk_key) and do not require encryption.</p>
<hr />
<a name="REPOSITORY_INDEX"><h2>Table REPOSITORY_INDEX</h2></a>
<p>Maps a repository name, as presented in the URL by an end-user or
client application, into its internal repository_id value. This
mapping allows the repository name to be quickly modified (e.g.
renamed) without needing to update the larger data rows of the
repository.</p>
<p>The exact semantics of the repository_name format is left as a
deployment decision, but DNS hostname, '/', repository name would be
one common usage.</p>
<h3>Row Key</h3>
<div class='rowkey'>
<div>
<span class='header'>Row Key:</span>
<span class='var'>repository_name</span>
</div>
<p>Human readable name of the repository, typically derived from the
HTTP <code>Host</code> header and path in the URL.</p>
<p>Examples:</p>
<ul>
<li><span class='example'>com.example.git/pub/git/foo.git</span></li>
<li><span class='example'>com.example.git/~user/mystuff.git</span></li>
</ul>
</div>
<h3>Columns</h3>
<div class='colfamily'>
<div>
<span class='header'>Column:</span>
<a name="REPOSITORY_INDEX.id"><span class='lit'>id:</span></a>
</div>
<p>The repository_id, as an 8-digit hex ASCII string.</p>
</div>
<h3>Size Estimate</h3>
<p>Less than 100,000 rows. More likely estimate is 1,000 rows.
Total disk usage under 512 KiB, assuming 1,000 names and 256
characters per name.</p>
<h3>Updates</h3>
<p>Only on repository creation or rename, which is infrequent (&lt;10
rows/month). Updates are performed in a row-level transaction, to
ensure a name is either assigned uniquely, or fails.</p>
<h3>Reads</h3>
<p>Reads are tried first against memcache, then against the DHT if the
entry did not exist in memcache. Successful reads against the DHT are
put back into memcache in the background.</p>
<a name="REPOSITORY"><h2>Table REPOSITORY</h2></a>
<p>Tracks top-level information about each repository.</p>
<h3>Row Key</h3>
<div class='rowkey'>
<div>
<span class='header'>Row Key:</span>
<span class='var'>repository_id</span>
</div>
<p>The repository_id, as an 8-digit hex ASCII string.</p>
</div>
<p>Typically this is assigned sequentially, then has the bits reversed
to evenly spread repositories throughout the DHT. For example the
first repository is <code>80000000</code>, and the second is
<code>40000000</code>.</p>
<h3>Columns</h3>
<div class='colfamily'>
<div>
<span class='header'>Column:</span>
<a name="REPOSITORY.chunk-info"><span class='lit'>chunk-info:</span></a>
<span class='var'>chunk_key[short]</span>
</div>
<p>Cell value is the protocol message <a
href="#message_ChunkInfo">ChunkInfo</a> describing the chunk's
contents. Most of the message's fields are only useful for quota
accounting and reporting.</p>
</div>
<p>This column exists to track all of the chunks that make up a
repository's object set. Garbage collection and quota accounting tasks
can primarily drive off this column, rather than scanning the much
larger <a href="#CHUNK">CHUNK</a> table with a regular expression on
the chunk row key.</p>
<p>As each chunk averages 1 MiB in size, the linux-2.6 repository
(at 373 MiB) has about 400 chunks and thus about 400 chunk-info
cells. The chromium repository (at 1 GiB) has about 1000 chunk-info
cells. It would not be uncommon to have 2000 chunk-info cells.</p>
<div class='colfamily'>
<div>
<span class='header'>Column:</span>
<a name="REPOSITORY.cached-pack"><span class='lit'>cached-pack:</span></a>
<span class='var'>NNNNx38</span>
<span class='lit'>.</span>
<span class='var'>VVVVx38</span>
</div>
<p>Variables:</p>
<ul>
<li><span class='var'>NNNNx38</span> = 40 hex digit name of the cached pack</li>
<li><span class='var'>VVVVx38</span> = 40 hex digit version of the cached pack</li>
</ul>
<p>Examples:</p>
<ul>
<li><span class='example'>4e32fb97103981e7dd53dcc786640fa4fdb444b8.8975104a03d22e54f7060502e687599d1a2c2516</span></li>
</ul>
<p>Cell value is the protocol message <a
href="#message_CachedPackInfo">CachedPackInfo</a> describing the
chunks that make up a cached pack.</p>
</div>
<p>The <code>cached-pack</code> column family describes large lists of
chunks that when combined together in a specific order create a valid
Git pack file directly streamable to a client. This avoids needing to
enumerate and pack the entire repository on each request.</p>
<p>The cached-pack name (NNNNx38 above) is the SHA-1 of the objects
contained within the pack, in binary, sorted. This is the standard
naming convention for pack files on the local filesystem. The version
(VVVVx38 above) is the SHA-1 of the chunk keys, sorted. The version
makes the cached-pack cell unique, if any single bit in the compressed
data is modified a different version will be generated, and a
different cell will be used to describe the alternate version of the
same data. The version is necessary to prevent repacks of the same
object set (but with different compression settings or results) from
stepping on active readers.</p>
<h2>Size Estimate</h2>
<p>1 row per repository (~1,000 repositories), however the majority of
the storage cost is in the <code>chunk-info</code> column family,
which can have more than 2000 cells per repository.</p>
<p>Each <code>chunk-info</code> cell is on average 147 bytes. For a
large repository like chromium.git (over 1000 chunks) this is only 147
KiB for the entire row.</p>
<p>Each <code>cached-pack</code> cell is on average 5350 bytes. Most
repositories have 1 of these cells, 2 while the repository is being
repacked on the server side to update the cached-pack data.</p>
<h2>Updates</h2>
<p>Information about each ~1 MiB chunk of pack data received over the
network is stored as a unique column in the <code>chunk-info</code>
column family.</p>
<p>Most pushes are at least 2 chunks (commit, tree), with 50 pushes
per repository per day being possible (50,000 new cells/day).</p>
<p><b>TODO:</b> Average push rates?</p>
<h2>Reads</h2>
<p><i>Serving clients:</i> Read all cells in the
<code>cached-pack</code> column family, typically only 1-5 cells. The
cells are cached in memcache and read from there first.</p>
<p><i>Garbage collection:</i> Read all cells in the
<code>chunk-info</code> column family to determine which chunks are
owned by this repository, without scanning the <a href="#CHUNK">CHUNK</a> table.
Delete <code>chunk-info</code> after the corresponding <a href="#CHUNK">CHUNK</a>
row has been deleted. Unchanged chunks have their info left alone.</p>
<a name="REF"><h2>Table REF</h2></a>
<p>Associates a human readable branch (e.g.
<code>refs/heads/master</code>) or tag (e.g.
<code>refs/tags/v1.0</code>) name to the Git
object that represents that current state of
the repository.</p>
<h3>Row Key</h3>
<div class='rowkey'>
<div>
<span class='header'>Row Key:</span>
<span class='var'>repository_id</span>
<span class='lit'>:</span>
<span class='var'>ref_name</span>
</div>
<p>Variables:</p>
<ul>
<li><span class='var'>repository_id</span> = Repository owning the reference (see above)</li>
<li><span class='var'>ref_name</span> = Name of the reference, UTF-8 string</li>
</ul>
<p>Examples:</p>
<ul>
<li><span class='example'>80000000:HEAD</span></li>
<li><span class='example'>80000000:refs/heads/master</span></li>
<br />
<li><span class='example'>40000000:HEAD</span></li>
<li><span class='example'>40000000:refs/heads/master</span></li>
</ul>
</div>
<p>The separator <code>:</code> used in the row key was chosen because
this character is not permitted in a Git reference name.</p>
<h3>Columns</h3>
<div class='colfamily'>
<div>
<span class='header'>Column:</span>
<a name="REF.target"><span class='lit'>target:</span></a>
</div>
<p>Cell value is the protocol message
<a href="#message_RefData">RefData</a> describing the
current SHA-1 the reference points to, and the chunk
it was last observed in. The chunk hint allows skipping
a read of <a href="#OBJECT_INDEX">OBJECT_INDEX</a>.</p>
<p>Several versions (5) are stored for emergency rollbacks.
Additional versions beyond 5 are cleaned up during table garbage
collection as managed by the DHT's cell GC.</p>
</div>
<h3>Size Estimate</h3>
<p><i>Normal Git usage:</i> ~10 branches per repository, ~200 tags.
For 1,000 repositories, about 200,000 rows total. Average row size is
about 240 bytes/row before compression (67 after), or 48M total.</p>
<p><i>Gerrit Code Review usage:</i> More than 300 new rows per day.
Each snapshot of each change under review is one reference.</p>
<h3>Updates</h3>
<p>Writes are performed by doing an atomic compare-and-swap (through a
transaction), changing the RefData protocol buffer.</p>
<h3>Reads</h3>
<p>Reads perform prefix scan for all rows starting with
<code>repository_id:</code>. Plans exist to cache these reads within a
custom service, avoiding most DHT queries.</p>
<a name="OBJECT_INDEX"><h2>Table OBJECT_INDEX</h2></a>
<p>The Git network protocol has clients sending object SHA-1s to the
server, with no additional context or information. End-users may also
type a SHA-1 into a web search box. This table provides a mapping of
the object SHA-1 to which chunk(s) store the object's data. The table
is sometimes also called the 'global index', since it names where
every single object is stored.</p>
<h3>Row Key</h3>
<div class='rowkey'>
<div>
<span class='header'>Row Key:</span>
<span class='var'>NN</span>
<span class='lit'>.</span>
<span class='var'>repository_id</span>
<span class='lit'>.</span>
<span class='var'>NNx40</span>
</div>
<p>Variables:</p>
<ul>
<li><span class='var'>NN</span> = First 2 hex digits of object SHA-1</li>
<li><span class='var'>repository_id</span> = Repository owning the object (see above)</li>
<li><span class='var'>NNx40</span> = Complete object SHA-1 name, in hex</li>
</ul>
<p>Examples:</p>
<ul>
<li><span class='example'>2b.80000000.2b5c9037c81c38b3b9abc29a3a87a4abcd665ed4</span></li>
<li><span class='example'>8f.40000000.8f270a441569b127cc4af8a6ef601d94d9490efb</span></li>
</ul>
</div>
<p>The first 2 hex digits (<code>NN</code>) distribute object keys
within the same repository around the DHT keyspace, preventing a busy
repository from creating too much of a hot-spot within the DHT. To
simplify key generation, these 2 digits are repeated after the
repository_id, as part of the 40 hex digit object name.</p>
<p>Keys must be clustered by repository_id to support extending
abbreviations. End-users may supply an abbreviated SHA-1 of 4 or more
digits (up to 39) and ask the server to complete them to a full 40
digit SHA-1 if the server has the relevant object within the
repository's object set.</p>
<p>A schema variant that did not include the repository_id as part of
the row key was considered, but discarded because completing a short
4-6 digit abbreviated SHA-1 would be impractical once there were
billions of objects stored in the DHT. Git end-users expect to be able
to use 4 or 6 digit abbreviations on very small repositories, as the
number of objects is low and thus the number of bits required to
uniquely name an object within that object set is small.</p>
<h3>Columns</h3>
<div class='colfamily'>
<div>
<span class='header'>Column:</span>
<a name="OBJECT_INDEX.info"><span class='lit'>info:</span></a>
<span class='var'>chunk_key[short]</span>
</div>
<p>Cell value is the protocol message
<a href="#message_ObjectInfo">ObjectInfo</a> describing how the object
named by the row key is stored in the chunk named by the column name.</p>
<p>Cell timestamp matters. The <b>oldest cell</b> within the
entire column family is favored during reads. As chunk_key is
unique, versions within a single column aren't relevant.</p>
</div>
<h3>Size Estimate</h3>
<p>Average row size per object/chunk pair is 144 bytes uncompressed
(87 compressed), based on the linux-2.6 repository. The linux-2.6
repository has 1.8 million objects, and is growing at a rate of about
300,000 objects/year. Total usage for linux-2.6 is above 154M.</p>
<p>Most rows contain only 1 cell, as the object appears in only 1
chunk within that repository.</p>
<p><i>Worst case:</i> 1.8 million rows/repository * 1,000 repositories
is around 1.8 billion rows and 182G.</p>
<h3>Updates</h3>
<p>One write per object received over the network; typically performed
as part of an asynchronous batch. Each batch is sized around 512 KiB
(about 3000 rows). Because of SHA-1's uniform distribution, row keys
are first sorted and then batched into buckets of about 3000 rows. To
prevent too much activity from going to one table segment at a time
the complete object list is segmented into up to 32 groups which are
written in round-robin order.</p>
<p>A full push of the linux-2.6 repository writes 1.8 million
rows as there are 1.8 million objects in the pack stream.</p>
<p>During normal insert or receive operations, each received object is
a blind write to add one new <code>info:chunk_key[short]</code> cell
to the row. During repack, all cells in the <code>info</code> column
family are replaced with a single cell.</p>
<h3>Reads</h3>
<p>During common ancestor negotiation reads occur in batches of 64-128
full row keys, uniformly distributed throughout the key space. Most of
these reads are misses, the OBJECT_INDEX table does not contain the
key offered by the client. A successful negotation for most developers
requires at least two rounds of 64 objects back-to-back over HTTP. Due
to the high miss rate on this table, an in-memory bloom filter may be
important for performance.</p>
<p>To support the high read-rate (and high miss-rate) during common
ancestor negotation, an alternative to an in-memory bloom filter
within the DHT is to downoad the entire set of keys into an alternate
service job for recently accessed repositories. This service can only
be used if <i>all</i> of the keys for the same repository_id are
hosted within the service. Given this is under 36 GiB for the worst
case 1.8 billion rows mentioned above, this may be feasible. Loading
the table can be performed by fetching <a
href="#REPOSITORY.chunk-info">REPOSITORY.chunk-info</a> and then
performing parallel gets for the <a
href="#CHUNK.index">CHUNK.index</a> column, and scanning the local
indexes to construct the list of known objects.</p>
<p>During repacking with no delta reuse, worst case scenario requires
reading all records with the same repository_id (for linux-2.6 this
is 1.8 million rows). Reads are made in a configurable batch size,
right now this is set at 2048 keys/batch, with 4 concurrent batches in
flight at a time.</p>
<p>Reads are tried first against memcache, then against the DHT if the
entry did not exist in memcache. Successful reads against the DHT are
put back into memcache in the background.</p>
<a name="CHUNK"><h2>Table CHUNK</h2></a>
<p>Stores the object data for a repository, containing commit history,
directory structure, and file revisions. Each chunk is typically 1 MiB
in size, excluding the index and meta columns.</p>
<h3>Row Key</h3>
<div class='rowkey'>
<div>
<span class='header'>Row Key:</span>
<span class='var'>HH</span>
<span class='lit'>.</span>
<span class='var'>repository_id</span>
<span class='lit'>.</span>
<span class='var'>HHx40</span>
</div>
<p>Variables:</p>
<ul>
<li><span class='var'>HH</span> = First 2 hex digits of chunk SHA-1</li>
<li><span class='var'>repository_id</span> = Repository owning the chunk (see above)</li>
<li><span class='var'>HHx40</span> = Complete chunk SHA-1, in hex</li>
</ul>
<p>Examples:</p>
<ul>
<li><span class='example'>09.80000000.09e0eb57543be633b004b672cbebdf335aa4d53f</span> <i>(full key)</i></li>
</ul>
</div>
<p>Chunk keys are computed by first computing the SHA-1 of the
<code>chunk:</code> column, which is the compressed object contents
stored within the chunk. As the chunk data includes a 32 bit salt in
the trailing 4 bytes, this value is random even for the exact same
object input.</p>
<p>The leading 2 hex digit <code>HH</code> component distributes
chunks for the same repository (and over the same time period) evenly
around the DHT keyspace, preventing any portion from becoming too
hot.</p>
<h3>Columns</h3>
<div class='colfamily'>
<div>
<span class='header'>Column:</span>
<a name="CHUNK.chunk"><span class='lit'>chunk:</span></a>
</div>
<p>Multiple objects in Git pack-file format, about 1 MiB in size.
The data is already very highly compressed by Git and is not further
compressable by the DHT.</p>
</div>
<p>This column is essentially the standard Git pack-file format,
without the standard header or trailer. Objects can be stored in
either whole format (object content is simply deflated inline)
or in delta format (reference to a delta base is followed by
deflated sequence of copy and/or insert instructions to recreate
the object content). The OBJ_OFS_DELTA format is preferred
for deltas, since it tends to use a shorter encoding than the
OBJ_REF_DELTA format. Offsets beyond the start of the chunk are
actually offsets to other chunks, and must be resolved using the
<code>meta.base_chunk.relative_start</code> field.</p>
<p>Because the row key is derived from the SHA-1 of this column, the
trailing 4 bytes is randomly generated at insertion time, to make it
impractical for remote clients to predict the name of the chunk row.
This allows the stream parser to bindly insert rows without first
checking for row existance, or worry about replacing an existing
row and causing data corruption.</p>
<p>This column value is essentially opaque to the DHT.</p>
<div class='colfamily'>
<div>
<span class='header'>Column:</span>
<a name="CHUNK.index"><span class='lit'>index:</span></a>
</div>
<p>Binary searchable table listing object SHA-1 and starting offset
of that object within the <code>chunk:</code> data stream. The data
in this index is essentially random (due to the SHA-1s stored in
binary) and thus is not compressable.</p>
</div>
<p>Sorted list of SHA-1s of each object that is stored in this chunk,
along with the offset. This column allows efficient random access to
any object within the chunk, without needing to perform a remote read
against <a href="#OBJECT_INDEX">OBJECT_INDEX</a> table. The column is
very useful at read time, where pointers within Git objects will
frequently reference other objects stored in the same chunk.</p>
<p>This column is sometimes called the local index, since it is local
only to the chunk and thus differs from the global index stored in the
<a href="#OBJECT_INDEX">OBJECT_INDEX</a> table.</p>
<p>The column size is 24 bytes per object stored in the chunk. Commit
chunks store on average 2200 commits/chunk, so a commit chunk index is
about 52,800 bytes.</p>
<p>This column value is essentially opaque to the DHT.</p>
<div class='colfamily'>
<div>
<span class='header'>Column:</span>
<a name="CHUNK.meta"><span class='lit'>meta:</span></a>
</div>
<p>Cell value is the protocol message
<a href="#message_ChunkMeta">ChunkMeta</a> describing prefetch
hints, object fragmentation, and delta base locations. Unlike
<code>chunk:</code> and <code>index:</code>, this column is
somewhat compressable.</p>
</div>
<p>The meta column provides information critical for reading the
chunk's data. (Unlike <a href="#message_ChunkInfo">ChunkInfo</a> in
the <a href="#REPOSITORY">REPOSITORY</a> table, which is used only for
accounting.)</p>
<p>The most important element is the BaseChunk nested message,
describing a chunk that contains a base object required to inflate
an object that is stored in this chunk as a delta.</p>
<h3>Chunk Contents</h3>
<p>Chunks try to store only a single object type, however mixed object
type chunks are supported. The rule to store only one object type per
chunk improves data locality, reducing the number of chunks that need
to be accessed from the DHT in order to perform a particular Git
operation. Clustering commits together into a 'commit chunk' improves
data locality during log/history walking operations, while clustering
trees together into a 'tree chunk' improves data locality during the
early stages of packing or difference generation.</p>
<p>Chunks reuse the standard Git pack data format to support direct
streaming of a chunk's <code>chunk:</code> column to clients, without
needing to perform any data manipulation on the server. This enables
high speed data transfer from the DHT to the client.</p>
<h3>Large Object Fragmentation</h3>
<p>If a chunk contains more than one object, all objects within the
chunk must store their complete compressed form within the chunk. This
limits an object to less than 1 MiB of compressed data.</p>
<p>Larger objects whose compressed size is bigger than 1 MiB are
fragmented into multiple chunks. The first chunk contains the object's
pack header, and the first 1 MiB of compressed data. Subsequent data
is stored in additional chunks. The additional chunk keys are stored
in the <code>meta.fragment</code> field. Each chunk that is part of
the same large object redundantly stores the exact same meta
value.</p>
<h3>Size Estimate</h3>
<p>Approximately the same size if the repository was stored on the
local filesystem. For the linux-2.6 repository (373M / 1.8 million
objects), about 417M (373.75M in <code>chunk:</code>, 42.64M in
<code>index:</code>, 656K in <code>meta:</code>).</p>
<p>Row count is close to size / 1M (373M / 1M = 373 rows), but may be
slightly higher (e.g. 402) due to fractional chunks on the end of
large fragmented objects, or where the single object type rule caused a
chunk to close before it was full.</p>
<p>For the complete Android repository set, disk usage is ~13G.</p>
<h3>Updates</h3>
<p>This table is (mostly) append-only. Write operations blast in ~1
MiB chunks, as the key format assures writers the new row does not
already exist. Chunks are randomly scattered by the hashing function,
and are not buffered very deep by writers.</p>
<p><i>Interactive writes:</i> Small operations impacting only 1-5
chunks will write all columns in a single operation. Most chunks of
this varity will be very small, 1-10 objects per chunk and about 1-10
KiB worth of compressed data inside of the <code>chunk:</code> column.
This class of write represents a single change made by one developer
that must be shared back out immediately.</p>
<p><i>Large pushes:</i> Large operations impacting tens to hundreds of
chunks will first write the <code>chunk:</code> column, then come back
later and populate the <code>index:</code> and <code>meta:</code>
columns once all chunks have been written. The delayed writing of
index and meta during large operations is required because the
information for these columns is not available until the entire data
stream from the Git client has been received and scanned. As the Git
server may not have sufficient memory to store all chunk data (373M or
1G!), its written out first to free up memory.</p>
<p><i>Garbage collection:</i> Chunks that are not optimally sized
(less than the target ~1 MiB), optimally localized (too many graph
pointers outside of the chunk), or compressed (Git found a smaller way
to store the same content) will be replaced by first writing new
chunks, and then deleting the old chunks.</p>
<p>Worst case, this could churn as many as 402 rows and 373M worth of
data for the linux-2.6 repository. Special consideration will be made
to try and avoid replacing chunks whose <code>WWWW</code> key
component is 'sufficiently old' and whose content is already
sufficiently sized and compressed. This will help to limit churn to
only more recently dated chunks, which are smaller in size.</p>
<h3>Reads</h3>
<p>All columns are read together as a unit. Memcache is checked first,
with reads falling back to the DHT if the cache does not have the
chunk.</p>
<p>Reasonably accurate prefetching is supported through background
threads and prefetching metadata embedded in the <a
href="#message_CachedPackInfo">CachedPackInfo</a> and <a
href="#message_ChunkMeta">ChunkMeta</a> protocol messages used by
readers.</p>
<hr />
<h2>Protocol Messages</h2>
<pre>
package git_store;
option java_package = "org.eclipse.jgit.storage.dht.proto";
// Entry in RefTable describing the target of the reference.
// Either symref *OR* target must be populated, but never both.
//
<a name="message_RefData">message RefData</a> {
// An ObjectId with an optional hint about where it can be found.
//
message Id {
required string object_name = 1;
optional string chunk_key = 2;
}
// Name of another reference this reference inherits its target
// from. The target is inherited on-the-fly at runtime by reading
// the other reference. Typically only "HEAD" uses symref.
//
optional string symref = 1;
// ObjectId this reference currently points at.
//
optional Id target = 2;
// True if the correct value for peeled is stored.
//
optional bool is_peeled = 3;
// If is_peeled is true, this field is accurate. This field
// exists only if target points to annotated tag object, then
// this field stores the "object" field for that tag.
//
optional Id peeled = 4;
}
// Entry in ObjectIndexTable, describes how an object appears in a chunk.
//
<a name="message_ObjectInfo">message ObjectInfo</a> {
// Type of Git object.
//
enum ObjectType {
COMMIT = 1;
TREE = 2;
BLOB = 3;
TAG = 4;
}
optional ObjectType object_type = 1;
// Position of the object's header within its chunk.
//
required int32 offset = 2;
// Total number of compressed data bytes, not including the pack
// header. For fragmented objects this is the sum of all chunks.
//
required int64 packed_size = 3;
// Total number of bytes of the uncompressed object. For a
// delta this is the size after applying the delta onto its base.
//
required int64 inflated_size = 4;
// ObjectId of the delta base, if this object is stored as a delta.
// The base is stored in raw binary.
//
optional bytes delta_base = 5;
}
// Describes at a high-level the information about a chunk.
// A repository can use this summary to determine how much
// data is stored, or when garbage collection should occur.
//
<a name="message_ChunkInfo">message ChunkInfo</a> {
// Source of the chunk (what code path created it).
//
enum Source {
RECEIVE = 1; // Came in over the network from external source.
INSERT = 2; // Created in this repository (e.g. a merge).
REPACK = 3; // Generated during a repack of this repository.
}
optional Source source = 1;
// Type of Git object stored in this chunk.
//
enum ObjectType {
MIXED = 0;
COMMIT = 1;
TREE = 2;
BLOB = 3;
TAG = 4;
}
optional ObjectType object_type = 2;
// True if this chunk is a member of a fragmented object.
//
optional bool is_fragment = 3;
// If present, key of the CachedPackInfo object
// that this chunk is a member of.
//
optional string cached_pack_key = 4;
// Summary description of the objects stored here.
//
message ObjectCounts {
// Number of objects stored in this chunk.
//
optional int32 total = 1;
// Number of objects stored in whole (non-delta) form.
//
optional int32 whole = 2;
// Number of objects stored in OFS_DELTA format.
// The delta base appears in the same chunk, or
// may appear in an earlier chunk through the
// ChunkMeta.base_chunk link.
//
optional int32 ofs_delta = 3;
// Number of objects stored in REF_DELTA format.
// The delta base is at an unknown location.
//
optional int32 ref_delta = 4;
}
optional ObjectCounts object_counts = 5;
// Size in bytes of the chunk's compressed data column.
//
optional int32 chunk_size = 6;
// Size in bytes of the chunk's index.
//
optional int32 index_size = 7;
// Size in bytes of the meta information.
//
optional int32 meta_size = 8;
}
// Describes meta information about a chunk, stored inline with it.
//
<a name="message_ChunkMeta">message ChunkMeta</a> {
// Enumerates the other chunks this chunk depends upon by OFS_DELTA.
// Entries are sorted by relative_start ascending, enabling search. Thus
// the earliest chunk is at the end of the list.
//
message BaseChunk {
// Bytes between start of the base chunk and start of this chunk.
// Although the value is positive, its a negative offset.
//
required int64 relative_start = 1;
required string chunk_key = 2;
}
repeated BaseChunk base_chunk = 1;
// If this chunk is part of a fragment, key of every chunk that
// makes up the fragment, including this chunk.
//
repeated string fragment = 2;
// Chunks that should be prefetched if reading the current chunk.
//
message PrefetchHint {
repeated string edge = 1;
repeated string sequential = 2;
}
optional PrefetchHint commit_prefetch = 51;
optional PrefetchHint tree_prefetch = 52;
}
// Describes a CachedPack, for efficient bulk clones.
//
<a name="message_CachedPackInfo">message CachedPackInfo</a> {
// Unique name of the cached pack. This is the SHA-1 hash of
// all of the objects that make up the cached pack, sorted and
// in binary form. (Same rules as Git on the filesystem.)
//
required string name = 1;
// SHA-1 of all chunk keys, which are themselves SHA-1s of the
// raw chunk data. If any bit differs in compression (due to
// repacking) the version will differ.
//
required string version = 2;
// Total number of objects in the cached pack. This must be known
// in order to set the final resulting pack header correctly before it
// is sent to clients.
//
required int64 objects_total = 3;
// Number of objects stored as deltas, rather than deflated whole.
//
optional int64 objects_delta = 4;
// Total size of the chunks, in bytes, not including the chunk footer.
//
optional int64 bytes_total = 5;
// Objects this pack starts from.
//
message TipObjectList {
repeated string object_name = 1;
}
required TipObjectList tip_list = 6;
// Chunks, in order of occurrence in the stream.
//
message ChunkList {
repeated string chunk_key = 1;
}
required ChunkList chunk_list = 7;
}
</pre>
</body>
</html>