| <html> |
| <head> |
| <title>Git on DHT Schema</title> |
| |
| <style type='text/css'> |
| body { font-size: 10pt; } |
| h1 { font-size: 16pt; } |
| h2 { font-size: 12pt; } |
| h3 { font-size: 10pt; } |
| |
| body { |
| margin-left: 8em; |
| margin-right: 8em; |
| } |
| h1 { margin-left: -3em; } |
| h2 { margin-left: -2em; } |
| h3 { margin-left: -1em; } |
| hr { margin-left: -4em; margin-right: -4em; } |
| |
| .coltoc { |
| font-size: 8pt; |
| font-family: monospace; |
| } |
| |
| .rowkey { |
| margin-left: 1em; |
| padding-top: 0.2em; |
| padding-left: 1em; |
| padding-right: 1em; |
| width: 54em; |
| border: 1px dotted red; |
| background-color: #efefef; |
| white-space: nowrap; |
| } |
| .rowkey .header { |
| font-weight: bold; |
| padding-right: 1em; |
| } |
| .rowkey .var { |
| font-style: italic; |
| font-family: monospace; |
| } |
| .rowkey .lit { |
| font-weight: bold; |
| font-family: monospace; |
| } |
| .rowkey .example { |
| font-family: monospace; |
| } |
| .rowkey p { |
| white-space: normal; |
| } |
| |
| .colfamily { |
| margin-top: 0.5em; |
| padding-top: 0.2em; |
| padding-left: 1em; |
| padding-right: 1em; |
| width: 55em; |
| border: 1px dotted blue; |
| background-color: #efefef; |
| white-space: nowrap; |
| } |
| .colfamily .header { |
| font-weight: bold; |
| padding-right: 1em; |
| } |
| .colfamily .var { |
| font-style: italic; |
| font-family: monospace; |
| } |
| .colfamily .lit { |
| font-family: monospace; |
| } |
| .colfamily .example { |
| font-family: monospace; |
| } |
| .colfamily p { |
| white-space: normal; |
| } |
| |
| .summary_table { |
| border-collapse: collapse; |
| border-spacing: 0; |
| } |
| .summary_table .desc { |
| font-size: 8pt; |
| white-space: nowrap; |
| text-align: right; |
| width: 20em; |
| } |
| .summary_table td { |
| border: 1px dotted lightgray; |
| padding-top: 2px; |
| padding-bottom: 2px; |
| padding-left: 5px; |
| padding-right: 5px; |
| vertical-align: top; |
| } |
| .summary_table tr.no_border td { |
| border: none; |
| } |
| </style> |
| </head> |
| <body> |
| |
| <h1>Git on DHT Schema</h1> |
| |
| <p>Storing Git repositories on a Distributed Hash Table (DHT) may |
| improve scaling for large traffic, but also simplifies management when |
| there are many small repositories.</p> |
| |
| <h2>Table of Contents</h2> |
| <ul> |
| <li><a href="#concepts">Concepts</a></li> |
| <li><a href="#summary">Summary</a></li> |
| <li><a href="#security">Data Security</a></li> |
| |
| <li>Tables: |
| <ul> |
| <li><a href="#REPOSITORY_INDEX">Table REPOSITORY_INDEX</a> |
| ( |
| <a href="#REPOSITORY_INDEX.id" class="toccol">id</a> |
| )</li> |
| |
| <li><a href="#REPOSITORY">Table REPOSITORY</a> |
| ( |
| <a href="#REPOSITORY.chunk-info" class="toccol">chunk-info</a>, |
| <a href="#REPOSITORY.cached-pack" class="toccol">cached-pack</a> |
| )</li> |
| |
| <li><a href="#REF">Table REF</a> |
| ( |
| <a href="#REF.target" class="toccol">target</a> |
| )</li> |
| |
| <li><a href="#OBJECT_INDEX">Table OBJECT_INDEX</a> |
| ( |
| <a href="#OBJECT_INDEX.info" class="toccol">info</a> |
| )</li> |
| |
| <li><a href="#CHUNK">Table CHUNK</a> |
| ( |
| <a href="#CHUNK.chunk" class="toccol">chunk</a>, |
| <a href="#CHUNK.index" class="toccol">index</a>, |
| <a href="#CHUNK.meta" class="toccol">meta</a> |
| )</li> |
| </ul> |
| </li> |
| |
| <li>Protocol Messages: |
| <ul> |
| <li><a href="#message_RefData">RefData</a></li> |
| <li><a href="#message_ObjectInfo">ObjectInfo</a></li> |
| <li><a href="#message_ChunkInfo">ChunkInfo</a></li> |
| <li><a href="#message_ChunkMeta">ChunkMeta</a></li> |
| <li><a href="#message_CachedPackInfo">CachedPackInfo</a></li> |
| </ul> |
| </li> |
| </ul> |
| |
| <a name="concepts"><h2>Concepts</h2></a> |
| |
| <p><i>Git Repository</i>: Stores the version control history for a |
| single project. Each repository is a directed acyclic graph (DAG) |
| composed of objects. Revision history for a project is described by a |
| commit object pointing to the complete set of files that make up that |
| version of the project, and a pointer to the commit that came |
| immediately before it. Repositories also contain references, |
| associating a human readable branch or tag name to a specific commit |
| object. Tommi Virtanen has a |
| <a href="http://eagain.net/articles/git-for-computer-scientists/">more |
| detailed description of the Git DAG</a>.</p> |
| |
| <p><i>Object</i>: Git objects are named by applying the SHA-1 hash |
| algorithm to their contents. There are 4 object types: commit, tree, |
| blob, tag. Objects are typically stored deflated using libz deflate, |
| but may also be delta compressed against another similar object, |
| further reducing the storage required. The big factor for Git |
| repository size is usually object count, e.g. the linux-2.6 repository |
| contains 1.8 million objects.</p> |
| |
| <p><i>Reference</i>: Associates a human readable symbolic name, such |
| as <code>refs/heads/master</code> to a specific Git object, usually a |
| commit or tag. References are updated to point to the most recent |
| object whenever changes are committed to the repository.</p> |
| |
| <p><i>Git Pack File</i>: A container stream holding many objects in a |
| highly compressed format. On the local filesystem, Git uses pack files |
| to reduce both inode and space usage by combining millions of objects |
| into a single data stream. On the network, Git uses pack files as the |
| basic network protocol to transport objects from one system's |
| repository to another.</p> |
| |
| <p><i>Garbage Collection</i>: Scanning the Git object graph to locate |
| objects that are reachable, and others that are unreachable. Git also |
| generally performs data recompression during this task to produce more |
| optimal deltas between objects, reducing overall disk usage and data |
| transfer sizes. This is independent of any GC that may be performed by |
| the DHT to clean up old cells.</p> |
| |
| <p>The basic storage strategy employed by this schema is to break a |
| standard Git pack file into chunks, approximately 1 MiB in size. Each |
| chunk is stored as one row in the <a href="#CHUNK">CHUNK</a> table. |
| During reading, chunks are paged into the application on demand, but |
| may also be prefetched using prefetch hints. Rules are used to break |
| the standard pack into chunks, these rules help to improve reference |
| locality and reduce the number of chunk loads required to service |
| common operations. In a nutshell, the DHT is used as a virtual memory |
| system for pages about 1 MiB in size.</p> |
| |
| <a name="summary"><h2>Summary</h2></a> |
| |
| <p>The schema consists of a handful of tables. Size estimates are |
| given for one copy of the linux-2.6 Git repository, a relative tortue |
| test case that contains 1.8 million objects and is 425 MiB when stored |
| on the local filesystem. All sizes are before any replication made by |
| the DHT, or its underlying storage system.</p> |
| |
| <table style='margin-left: 2em' class='summary_table'> |
| <tr> |
| <th>Table</th> |
| <th>Rows</th> |
| <th>Cells/Row</th> |
| <th>Bytes</th> |
| <th>Bytes/Row</th> |
| </tr> |
| |
| <tr> |
| <td><a href="#REPOSITORY_INDEX">REPOSITORY_INDEX</a> |
| <div class='desc'>Map host+path to surrogate key.</div></td> |
| <td align='right'>1</td> |
| <td align='right'>1</td> |
| <td align='right'>< 100 bytes</td> |
| <td align='right'>< 100 bytes</td> |
| </tr> |
| |
| <tr> |
| <td><a href="#REPOSITORY">REPOSITORY</a> |
| <div class='desc'>Accounting and replica management.</div></td> |
| <td align='right'>1</td> |
| <td align='right'>403</td> |
| <td align='right'>65 KiB</td> |
| <td align='right'>65 KiB</td> |
| </tr> |
| |
| <tr> |
| <td><a href="#REF">REF</a> |
| <div class='desc'>Bind branch/tag name to Git object.</div></td> |
| <td align='right'>211</td> |
| <td align='right'>1</td> |
| <td align='right'>14 KiB</td> |
| <td align='right'>67 bytes</td> |
| </tr> |
| |
| <tr> |
| <td><a href="#OBJECT_INDEX">OBJECT_INDEX</a> |
| <div class='desc'>Locate Git object by SHA-1 name.</div></td> |
| <td align='right'>1,861,833</td> |
| <td align='right'>1</td> |
| <td align='right'>154 MiB</td> |
| <td align='right'>87 bytes</td> |
| </tr> |
| |
| <tr> |
| <td><a href="#CHUNK">CHUNK</a> |
| <div class='desc'>Complete Git object storage.</div></td> |
| <td align='right'>402</td> |
| <td align='right'>3</td> |
| <td align='right'>417 MiB</td> |
| <td align='right'>~ 1 MiB</td> |
| </tr> |
| |
| <tr class='no_border'> |
| <td align='right'><i>Total</i></td> |
| <td align='right'>1,862,448</td> |
| <td align='right'></td> |
| <td align='right'>571 MiB</td> |
| <td align='right'></td> |
| </tr> |
| </table> |
| |
| <a name="security"><h2>Data Security</h2></a> |
| |
| <p>If data encryption is necessary to protect file contents, the <a |
| href="#CHUNK.chunk">CHUNK.chunk</a> column can be encrypted with a |
| block cipher such as AES. This column contains the revision commit |
| messages, file paths, and file contents. By encrypting one column, the |
| majority of the repository data is secured. As each cell value is |
| about 1 MiB and contains a trailing 4 bytes of random data, an ECB |
| mode of operation may be sufficient. Because the cells are already |
| very highly compressed using the Git data compression algorithms, |
| there is no increase in disk usage due to encryption.</p> |
| |
| <p>Branch and tag names (<a href="#REF">REF</a> row keys) are not |
| encrypted. If these need to be secured the portion after the ':' would |
| need to be encrypted with a block cipher. However these strings are |
| very short and very common (HEAD, refs/heads/master, refs/tags/v1.0), |
| making encryption difficult. A variation on the schema might move all |
| rows for a repository into a single protocol messsage, then encrypt |
| the protobuf into a single cell. Unfortunately this strategy has a |
| high update cost, and references change frequently.</p> |
| |
| <p>Object SHA-1 names (<a href="#OBJECT_INDEX">OBJECT_INDEX</a> row |
| keys and <a href="#CHUNK.index">CHUNK.index</a> values) are not |
| encrypted. This allows a reader to determine if a repository contains |
| a specific revision, but does not allow them to inspect the contents |
| of the revision. The CHUNK.index column could also be encrypted with a |
| block cipher when CHUNK.chunk is encrypted (see above), however the |
| OBJECT_INDEX table row keys cannot be encrypted if abbrevation |
| expansion is to be supported for end-users of the repository. The row |
| keys must be unencrypted as abbreviation resolution is performed by a |
| prefix range scan over the keys.</p> |
| |
| <p>The remaining tables and columns contain only statistics (e.g. |
| object counts or cell sizes), or internal surrogate keys |
| (repository_id, chunk_key) and do not require encryption.</p> |
| |
| <hr /> |
| <a name="REPOSITORY_INDEX"><h2>Table REPOSITORY_INDEX</h2></a> |
| |
| <p>Maps a repository name, as presented in the URL by an end-user or |
| client application, into its internal repository_id value. This |
| mapping allows the repository name to be quickly modified (e.g. |
| renamed) without needing to update the larger data rows of the |
| repository.</p> |
| |
| <p>The exact semantics of the repository_name format is left as a |
| deployment decision, but DNS hostname, '/', repository name would be |
| one common usage.</p> |
| |
| <h3>Row Key</h3> |
| <div class='rowkey'> |
| <div> |
| <span class='header'>Row Key:</span> |
| <span class='var'>repository_name</span> |
| </div> |
| |
| <p>Human readable name of the repository, typically derived from the |
| HTTP <code>Host</code> header and path in the URL.</p> |
| |
| <p>Examples:</p> |
| <ul> |
| <li><span class='example'>com.example.git/pub/git/foo.git</span></li> |
| <li><span class='example'>com.example.git/~user/mystuff.git</span></li> |
| </ul> |
| </div> |
| |
| <h3>Columns</h3> |
| <div class='colfamily'> |
| <div> |
| <span class='header'>Column:</span> |
| <a name="REPOSITORY_INDEX.id"><span class='lit'>id:</span></a> |
| </div> |
| |
| <p>The repository_id, as an 8-digit hex ASCII string.</p> |
| </div> |
| |
| <h3>Size Estimate</h3> |
| |
| <p>Less than 100,000 rows. More likely estimate is 1,000 rows. |
| Total disk usage under 512 KiB, assuming 1,000 names and 256 |
| characters per name.</p> |
| |
| <h3>Updates</h3> |
| |
| <p>Only on repository creation or rename, which is infrequent (<10 |
| rows/month). Updates are performed in a row-level transaction, to |
| ensure a name is either assigned uniquely, or fails.</p> |
| |
| <h3>Reads</h3> |
| |
| <p>Reads are tried first against memcache, then against the DHT if the |
| entry did not exist in memcache. Successful reads against the DHT are |
| put back into memcache in the background.</p> |
| |
| <a name="REPOSITORY"><h2>Table REPOSITORY</h2></a> |
| |
| <p>Tracks top-level information about each repository.</p> |
| |
| <h3>Row Key</h3> |
| <div class='rowkey'> |
| <div> |
| <span class='header'>Row Key:</span> |
| <span class='var'>repository_id</span> |
| </div> |
| |
| <p>The repository_id, as an 8-digit hex ASCII string.</p> |
| </div> |
| |
| <p>Typically this is assigned sequentially, then has the bits reversed |
| to evenly spread repositories throughout the DHT. For example the |
| first repository is <code>80000000</code>, and the second is |
| <code>40000000</code>.</p> |
| |
| <h3>Columns</h3> |
| <div class='colfamily'> |
| <div> |
| <span class='header'>Column:</span> |
| <a name="REPOSITORY.chunk-info"><span class='lit'>chunk-info:</span></a> |
| <span class='var'>chunk_key[short]</span> |
| </div> |
| |
| <p>Cell value is the protocol message <a |
| href="#message_ChunkInfo">ChunkInfo</a> describing the chunk's |
| contents. Most of the message's fields are only useful for quota |
| accounting and reporting.</p> |
| </div> |
| |
| <p>This column exists to track all of the chunks that make up a |
| repository's object set. Garbage collection and quota accounting tasks |
| can primarily drive off this column, rather than scanning the much |
| larger <a href="#CHUNK">CHUNK</a> table with a regular expression on |
| the chunk row key.</p> |
| |
| <p>As each chunk averages 1 MiB in size, the linux-2.6 repository |
| (at 373 MiB) has about 400 chunks and thus about 400 chunk-info |
| cells. The chromium repository (at 1 GiB) has about 1000 chunk-info |
| cells. It would not be uncommon to have 2000 chunk-info cells.</p> |
| |
| <div class='colfamily'> |
| <div> |
| <span class='header'>Column:</span> |
| <a name="REPOSITORY.cached-pack"><span class='lit'>cached-pack:</span></a> |
| <span class='var'>NNNNx38</span> |
| <span class='lit'>.</span> |
| <span class='var'>VVVVx38</span> |
| </div> |
| |
| <p>Variables:</p> |
| <ul> |
| <li><span class='var'>NNNNx38</span> = 40 hex digit name of the cached pack</li> |
| <li><span class='var'>VVVVx38</span> = 40 hex digit version of the cached pack</li> |
| </ul> |
| |
| <p>Examples:</p> |
| <ul> |
| <li><span class='example'>4e32fb97103981e7dd53dcc786640fa4fdb444b8.8975104a03d22e54f7060502e687599d1a2c2516</span></li> |
| </ul> |
| |
| <p>Cell value is the protocol message <a |
| href="#message_CachedPackInfo">CachedPackInfo</a> describing the |
| chunks that make up a cached pack.</p> |
| </div> |
| |
| <p>The <code>cached-pack</code> column family describes large lists of |
| chunks that when combined together in a specific order create a valid |
| Git pack file directly streamable to a client. This avoids needing to |
| enumerate and pack the entire repository on each request.</p> |
| |
| <p>The cached-pack name (NNNNx38 above) is the SHA-1 of the objects |
| contained within the pack, in binary, sorted. This is the standard |
| naming convention for pack files on the local filesystem. The version |
| (VVVVx38 above) is the SHA-1 of the chunk keys, sorted. The version |
| makes the cached-pack cell unique, if any single bit in the compressed |
| data is modified a different version will be generated, and a |
| different cell will be used to describe the alternate version of the |
| same data. The version is necessary to prevent repacks of the same |
| object set (but with different compression settings or results) from |
| stepping on active readers.</p> |
| |
| <h2>Size Estimate</h2> |
| |
| <p>1 row per repository (~1,000 repositories), however the majority of |
| the storage cost is in the <code>chunk-info</code> column family, |
| which can have more than 2000 cells per repository.</p> |
| |
| <p>Each <code>chunk-info</code> cell is on average 147 bytes. For a |
| large repository like chromium.git (over 1000 chunks) this is only 147 |
| KiB for the entire row.</p> |
| |
| <p>Each <code>cached-pack</code> cell is on average 5350 bytes. Most |
| repositories have 1 of these cells, 2 while the repository is being |
| repacked on the server side to update the cached-pack data.</p> |
| |
| <h2>Updates</h2> |
| |
| <p>Information about each ~1 MiB chunk of pack data received over the |
| network is stored as a unique column in the <code>chunk-info</code> |
| column family.</p> |
| |
| <p>Most pushes are at least 2 chunks (commit, tree), with 50 pushes |
| per repository per day being possible (50,000 new cells/day).</p> |
| |
| <p><b>TODO:</b> Average push rates?</p> |
| |
| <h2>Reads</h2> |
| |
| <p><i>Serving clients:</i> Read all cells in the |
| <code>cached-pack</code> column family, typically only 1-5 cells. The |
| cells are cached in memcache and read from there first.</p> |
| |
| <p><i>Garbage collection:</i> Read all cells in the |
| <code>chunk-info</code> column family to determine which chunks are |
| owned by this repository, without scanning the <a href="#CHUNK">CHUNK</a> table. |
| Delete <code>chunk-info</code> after the corresponding <a href="#CHUNK">CHUNK</a> |
| row has been deleted. Unchanged chunks have their info left alone.</p> |
| |
| <a name="REF"><h2>Table REF</h2></a> |
| |
| <p>Associates a human readable branch (e.g. |
| <code>refs/heads/master</code>) or tag (e.g. |
| <code>refs/tags/v1.0</code>) name to the Git |
| object that represents that current state of |
| the repository.</p> |
| |
| <h3>Row Key</h3> |
| <div class='rowkey'> |
| <div> |
| <span class='header'>Row Key:</span> |
| <span class='var'>repository_id</span> |
| <span class='lit'>:</span> |
| <span class='var'>ref_name</span> |
| </div> |
| |
| <p>Variables:</p> |
| <ul> |
| <li><span class='var'>repository_id</span> = Repository owning the reference (see above)</li> |
| <li><span class='var'>ref_name</span> = Name of the reference, UTF-8 string</li> |
| </ul> |
| |
| <p>Examples:</p> |
| <ul> |
| <li><span class='example'>80000000:HEAD</span></li> |
| <li><span class='example'>80000000:refs/heads/master</span></li> |
| <br /> |
| <li><span class='example'>40000000:HEAD</span></li> |
| <li><span class='example'>40000000:refs/heads/master</span></li> |
| </ul> |
| </div> |
| |
| <p>The separator <code>:</code> used in the row key was chosen because |
| this character is not permitted in a Git reference name.</p> |
| |
| <h3>Columns</h3> |
| <div class='colfamily'> |
| <div> |
| <span class='header'>Column:</span> |
| <a name="REF.target"><span class='lit'>target:</span></a> |
| </div> |
| |
| <p>Cell value is the protocol message |
| <a href="#message_RefData">RefData</a> describing the |
| current SHA-1 the reference points to, and the chunk |
| it was last observed in. The chunk hint allows skipping |
| a read of <a href="#OBJECT_INDEX">OBJECT_INDEX</a>.</p> |
| |
| <p>Several versions (5) are stored for emergency rollbacks. |
| Additional versions beyond 5 are cleaned up during table garbage |
| collection as managed by the DHT's cell GC.</p> |
| </div> |
| |
| <h3>Size Estimate</h3> |
| |
| <p><i>Normal Git usage:</i> ~10 branches per repository, ~200 tags. |
| For 1,000 repositories, about 200,000 rows total. Average row size is |
| about 240 bytes/row before compression (67 after), or 48M total.</p> |
| |
| <p><i>Gerrit Code Review usage:</i> More than 300 new rows per day. |
| Each snapshot of each change under review is one reference.</p> |
| |
| <h3>Updates</h3> |
| |
| <p>Writes are performed by doing an atomic compare-and-swap (through a |
| transaction), changing the RefData protocol buffer.</p> |
| |
| <h3>Reads</h3> |
| |
| <p>Reads perform prefix scan for all rows starting with |
| <code>repository_id:</code>. Plans exist to cache these reads within a |
| custom service, avoiding most DHT queries.</p> |
| |
| <a name="OBJECT_INDEX"><h2>Table OBJECT_INDEX</h2></a> |
| |
| <p>The Git network protocol has clients sending object SHA-1s to the |
| server, with no additional context or information. End-users may also |
| type a SHA-1 into a web search box. This table provides a mapping of |
| the object SHA-1 to which chunk(s) store the object's data. The table |
| is sometimes also called the 'global index', since it names where |
| every single object is stored.</p> |
| |
| <h3>Row Key</h3> |
| <div class='rowkey'> |
| <div> |
| <span class='header'>Row Key:</span> |
| <span class='var'>NN</span> |
| <span class='lit'>.</span> |
| <span class='var'>repository_id</span> |
| <span class='lit'>.</span> |
| <span class='var'>NNx40</span> |
| </div> |
| |
| <p>Variables:</p> |
| <ul> |
| <li><span class='var'>NN</span> = First 2 hex digits of object SHA-1</li> |
| <li><span class='var'>repository_id</span> = Repository owning the object (see above)</li> |
| <li><span class='var'>NNx40</span> = Complete object SHA-1 name, in hex</li> |
| </ul> |
| |
| <p>Examples:</p> |
| <ul> |
| <li><span class='example'>2b.80000000.2b5c9037c81c38b3b9abc29a3a87a4abcd665ed4</span></li> |
| <li><span class='example'>8f.40000000.8f270a441569b127cc4af8a6ef601d94d9490efb</span></li> |
| </ul> |
| </div> |
| |
| <p>The first 2 hex digits (<code>NN</code>) distribute object keys |
| within the same repository around the DHT keyspace, preventing a busy |
| repository from creating too much of a hot-spot within the DHT. To |
| simplify key generation, these 2 digits are repeated after the |
| repository_id, as part of the 40 hex digit object name.</p> |
| |
| <p>Keys must be clustered by repository_id to support extending |
| abbreviations. End-users may supply an abbreviated SHA-1 of 4 or more |
| digits (up to 39) and ask the server to complete them to a full 40 |
| digit SHA-1 if the server has the relevant object within the |
| repository's object set.</p> |
| |
| <p>A schema variant that did not include the repository_id as part of |
| the row key was considered, but discarded because completing a short |
| 4-6 digit abbreviated SHA-1 would be impractical once there were |
| billions of objects stored in the DHT. Git end-users expect to be able |
| to use 4 or 6 digit abbreviations on very small repositories, as the |
| number of objects is low and thus the number of bits required to |
| uniquely name an object within that object set is small.</p> |
| |
| <h3>Columns</h3> |
| <div class='colfamily'> |
| <div> |
| <span class='header'>Column:</span> |
| <a name="OBJECT_INDEX.info"><span class='lit'>info:</span></a> |
| <span class='var'>chunk_key[short]</span> |
| </div> |
| |
| <p>Cell value is the protocol message |
| <a href="#message_ObjectInfo">ObjectInfo</a> describing how the object |
| named by the row key is stored in the chunk named by the column name.</p> |
| |
| <p>Cell timestamp matters. The <b>oldest cell</b> within the |
| entire column family is favored during reads. As chunk_key is |
| unique, versions within a single column aren't relevant.</p> |
| </div> |
| |
| <h3>Size Estimate</h3> |
| |
| <p>Average row size per object/chunk pair is 144 bytes uncompressed |
| (87 compressed), based on the linux-2.6 repository. The linux-2.6 |
| repository has 1.8 million objects, and is growing at a rate of about |
| 300,000 objects/year. Total usage for linux-2.6 is above 154M.</p> |
| |
| <p>Most rows contain only 1 cell, as the object appears in only 1 |
| chunk within that repository.</p> |
| |
| <p><i>Worst case:</i> 1.8 million rows/repository * 1,000 repositories |
| is around 1.8 billion rows and 182G.</p> |
| |
| <h3>Updates</h3> |
| |
| <p>One write per object received over the network; typically performed |
| as part of an asynchronous batch. Each batch is sized around 512 KiB |
| (about 3000 rows). Because of SHA-1's uniform distribution, row keys |
| are first sorted and then batched into buckets of about 3000 rows. To |
| prevent too much activity from going to one table segment at a time |
| the complete object list is segmented into up to 32 groups which are |
| written in round-robin order.</p> |
| |
| <p>A full push of the linux-2.6 repository writes 1.8 million |
| rows as there are 1.8 million objects in the pack stream.</p> |
| |
| <p>During normal insert or receive operations, each received object is |
| a blind write to add one new <code>info:chunk_key[short]</code> cell |
| to the row. During repack, all cells in the <code>info</code> column |
| family are replaced with a single cell.</p> |
| |
| <h3>Reads</h3> |
| |
| <p>During common ancestor negotiation reads occur in batches of 64-128 |
| full row keys, uniformly distributed throughout the key space. Most of |
| these reads are misses, the OBJECT_INDEX table does not contain the |
| key offered by the client. A successful negotation for most developers |
| requires at least two rounds of 64 objects back-to-back over HTTP. Due |
| to the high miss rate on this table, an in-memory bloom filter may be |
| important for performance.</p> |
| |
| <p>To support the high read-rate (and high miss-rate) during common |
| ancestor negotation, an alternative to an in-memory bloom filter |
| within the DHT is to downoad the entire set of keys into an alternate |
| service job for recently accessed repositories. This service can only |
| be used if <i>all</i> of the keys for the same repository_id are |
| hosted within the service. Given this is under 36 GiB for the worst |
| case 1.8 billion rows mentioned above, this may be feasible. Loading |
| the table can be performed by fetching <a |
| href="#REPOSITORY.chunk-info">REPOSITORY.chunk-info</a> and then |
| performing parallel gets for the <a |
| href="#CHUNK.index">CHUNK.index</a> column, and scanning the local |
| indexes to construct the list of known objects.</p> |
| |
| <p>During repacking with no delta reuse, worst case scenario requires |
| reading all records with the same repository_id (for linux-2.6 this |
| is 1.8 million rows). Reads are made in a configurable batch size, |
| right now this is set at 2048 keys/batch, with 4 concurrent batches in |
| flight at a time.</p> |
| |
| <p>Reads are tried first against memcache, then against the DHT if the |
| entry did not exist in memcache. Successful reads against the DHT are |
| put back into memcache in the background.</p> |
| |
| <a name="CHUNK"><h2>Table CHUNK</h2></a> |
| |
| <p>Stores the object data for a repository, containing commit history, |
| directory structure, and file revisions. Each chunk is typically 1 MiB |
| in size, excluding the index and meta columns.</p> |
| |
| <h3>Row Key</h3> |
| <div class='rowkey'> |
| <div> |
| <span class='header'>Row Key:</span> |
| <span class='var'>HH</span> |
| <span class='lit'>.</span> |
| <span class='var'>repository_id</span> |
| <span class='lit'>.</span> |
| <span class='var'>HHx40</span> |
| </div> |
| |
| <p>Variables:</p> |
| <ul> |
| <li><span class='var'>HH</span> = First 2 hex digits of chunk SHA-1</li> |
| <li><span class='var'>repository_id</span> = Repository owning the chunk (see above)</li> |
| <li><span class='var'>HHx40</span> = Complete chunk SHA-1, in hex</li> |
| </ul> |
| |
| <p>Examples:</p> |
| <ul> |
| <li><span class='example'>09.80000000.09e0eb57543be633b004b672cbebdf335aa4d53f</span> <i>(full key)</i></li> |
| </ul> |
| </div> |
| |
| <p>Chunk keys are computed by first computing the SHA-1 of the |
| <code>chunk:</code> column, which is the compressed object contents |
| stored within the chunk. As the chunk data includes a 32 bit salt in |
| the trailing 4 bytes, this value is random even for the exact same |
| object input.</p> |
| |
| <p>The leading 2 hex digit <code>HH</code> component distributes |
| chunks for the same repository (and over the same time period) evenly |
| around the DHT keyspace, preventing any portion from becoming too |
| hot.</p> |
| |
| <h3>Columns</h3> |
| <div class='colfamily'> |
| <div> |
| <span class='header'>Column:</span> |
| <a name="CHUNK.chunk"><span class='lit'>chunk:</span></a> |
| </div> |
| |
| <p>Multiple objects in Git pack-file format, about 1 MiB in size. |
| The data is already very highly compressed by Git and is not further |
| compressable by the DHT.</p> |
| </div> |
| |
| <p>This column is essentially the standard Git pack-file format, |
| without the standard header or trailer. Objects can be stored in |
| either whole format (object content is simply deflated inline) |
| or in delta format (reference to a delta base is followed by |
| deflated sequence of copy and/or insert instructions to recreate |
| the object content). The OBJ_OFS_DELTA format is preferred |
| for deltas, since it tends to use a shorter encoding than the |
| OBJ_REF_DELTA format. Offsets beyond the start of the chunk are |
| actually offsets to other chunks, and must be resolved using the |
| <code>meta.base_chunk.relative_start</code> field.</p> |
| |
| <p>Because the row key is derived from the SHA-1 of this column, the |
| trailing 4 bytes is randomly generated at insertion time, to make it |
| impractical for remote clients to predict the name of the chunk row. |
| This allows the stream parser to bindly insert rows without first |
| checking for row existance, or worry about replacing an existing |
| row and causing data corruption.</p> |
| |
| <p>This column value is essentially opaque to the DHT.</p> |
| |
| <div class='colfamily'> |
| <div> |
| <span class='header'>Column:</span> |
| <a name="CHUNK.index"><span class='lit'>index:</span></a> |
| </div> |
| |
| <p>Binary searchable table listing object SHA-1 and starting offset |
| of that object within the <code>chunk:</code> data stream. The data |
| in this index is essentially random (due to the SHA-1s stored in |
| binary) and thus is not compressable.</p> |
| </div> |
| |
| <p>Sorted list of SHA-1s of each object that is stored in this chunk, |
| along with the offset. This column allows efficient random access to |
| any object within the chunk, without needing to perform a remote read |
| against <a href="#OBJECT_INDEX">OBJECT_INDEX</a> table. The column is |
| very useful at read time, where pointers within Git objects will |
| frequently reference other objects stored in the same chunk.</p> |
| |
| <p>This column is sometimes called the local index, since it is local |
| only to the chunk and thus differs from the global index stored in the |
| <a href="#OBJECT_INDEX">OBJECT_INDEX</a> table.</p> |
| |
| <p>The column size is 24 bytes per object stored in the chunk. Commit |
| chunks store on average 2200 commits/chunk, so a commit chunk index is |
| about 52,800 bytes.</p> |
| |
| <p>This column value is essentially opaque to the DHT.</p> |
| |
| <div class='colfamily'> |
| <div> |
| <span class='header'>Column:</span> |
| <a name="CHUNK.meta"><span class='lit'>meta:</span></a> |
| </div> |
| |
| <p>Cell value is the protocol message |
| <a href="#message_ChunkMeta">ChunkMeta</a> describing prefetch |
| hints, object fragmentation, and delta base locations. Unlike |
| <code>chunk:</code> and <code>index:</code>, this column is |
| somewhat compressable.</p> |
| </div> |
| |
| <p>The meta column provides information critical for reading the |
| chunk's data. (Unlike <a href="#message_ChunkInfo">ChunkInfo</a> in |
| the <a href="#REPOSITORY">REPOSITORY</a> table, which is used only for |
| accounting.)</p> |
| |
| <p>The most important element is the BaseChunk nested message, |
| describing a chunk that contains a base object required to inflate |
| an object that is stored in this chunk as a delta.</p> |
| |
| <h3>Chunk Contents</h3> |
| |
| <p>Chunks try to store only a single object type, however mixed object |
| type chunks are supported. The rule to store only one object type per |
| chunk improves data locality, reducing the number of chunks that need |
| to be accessed from the DHT in order to perform a particular Git |
| operation. Clustering commits together into a 'commit chunk' improves |
| data locality during log/history walking operations, while clustering |
| trees together into a 'tree chunk' improves data locality during the |
| early stages of packing or difference generation.</p> |
| |
| <p>Chunks reuse the standard Git pack data format to support direct |
| streaming of a chunk's <code>chunk:</code> column to clients, without |
| needing to perform any data manipulation on the server. This enables |
| high speed data transfer from the DHT to the client.</p> |
| |
| <h3>Large Object Fragmentation</h3> |
| |
| <p>If a chunk contains more than one object, all objects within the |
| chunk must store their complete compressed form within the chunk. This |
| limits an object to less than 1 MiB of compressed data.</p> |
| |
| <p>Larger objects whose compressed size is bigger than 1 MiB are |
| fragmented into multiple chunks. The first chunk contains the object's |
| pack header, and the first 1 MiB of compressed data. Subsequent data |
| is stored in additional chunks. The additional chunk keys are stored |
| in the <code>meta.fragment</code> field. Each chunk that is part of |
| the same large object redundantly stores the exact same meta |
| value.</p> |
| |
| <h3>Size Estimate</h3> |
| |
| <p>Approximately the same size if the repository was stored on the |
| local filesystem. For the linux-2.6 repository (373M / 1.8 million |
| objects), about 417M (373.75M in <code>chunk:</code>, 42.64M in |
| <code>index:</code>, 656K in <code>meta:</code>).</p> |
| |
| <p>Row count is close to size / 1M (373M / 1M = 373 rows), but may be |
| slightly higher (e.g. 402) due to fractional chunks on the end of |
| large fragmented objects, or where the single object type rule caused a |
| chunk to close before it was full.</p> |
| |
| <p>For the complete Android repository set, disk usage is ~13G.</p> |
| |
| <h3>Updates</h3> |
| |
| <p>This table is (mostly) append-only. Write operations blast in ~1 |
| MiB chunks, as the key format assures writers the new row does not |
| already exist. Chunks are randomly scattered by the hashing function, |
| and are not buffered very deep by writers.</p> |
| |
| <p><i>Interactive writes:</i> Small operations impacting only 1-5 |
| chunks will write all columns in a single operation. Most chunks of |
| this varity will be very small, 1-10 objects per chunk and about 1-10 |
| KiB worth of compressed data inside of the <code>chunk:</code> column. |
| This class of write represents a single change made by one developer |
| that must be shared back out immediately.</p> |
| |
| <p><i>Large pushes:</i> Large operations impacting tens to hundreds of |
| chunks will first write the <code>chunk:</code> column, then come back |
| later and populate the <code>index:</code> and <code>meta:</code> |
| columns once all chunks have been written. The delayed writing of |
| index and meta during large operations is required because the |
| information for these columns is not available until the entire data |
| stream from the Git client has been received and scanned. As the Git |
| server may not have sufficient memory to store all chunk data (373M or |
| 1G!), its written out first to free up memory.</p> |
| |
| <p><i>Garbage collection:</i> Chunks that are not optimally sized |
| (less than the target ~1 MiB), optimally localized (too many graph |
| pointers outside of the chunk), or compressed (Git found a smaller way |
| to store the same content) will be replaced by first writing new |
| chunks, and then deleting the old chunks.</p> |
| |
| <p>Worst case, this could churn as many as 402 rows and 373M worth of |
| data for the linux-2.6 repository. Special consideration will be made |
| to try and avoid replacing chunks whose <code>WWWW</code> key |
| component is 'sufficiently old' and whose content is already |
| sufficiently sized and compressed. This will help to limit churn to |
| only more recently dated chunks, which are smaller in size.</p> |
| |
| <h3>Reads</h3> |
| |
| <p>All columns are read together as a unit. Memcache is checked first, |
| with reads falling back to the DHT if the cache does not have the |
| chunk.</p> |
| |
| <p>Reasonably accurate prefetching is supported through background |
| threads and prefetching metadata embedded in the <a |
| href="#message_CachedPackInfo">CachedPackInfo</a> and <a |
| href="#message_ChunkMeta">ChunkMeta</a> protocol messages used by |
| readers.</p> |
| |
| <hr /> |
| <h2>Protocol Messages</h2> |
| |
| <pre> |
| package git_store; |
| option java_package = "org.eclipse.jgit.storage.dht.proto"; |
| |
| |
| // Entry in RefTable describing the target of the reference. |
| // Either symref *OR* target must be populated, but never both. |
| // |
| <a name="message_RefData">message RefData</a> { |
| // An ObjectId with an optional hint about where it can be found. |
| // |
| message Id { |
| required string object_name = 1; |
| optional string chunk_key = 2; |
| } |
| |
| // Name of another reference this reference inherits its target |
| // from. The target is inherited on-the-fly at runtime by reading |
| // the other reference. Typically only "HEAD" uses symref. |
| // |
| optional string symref = 1; |
| |
| // ObjectId this reference currently points at. |
| // |
| optional Id target = 2; |
| |
| // True if the correct value for peeled is stored. |
| // |
| optional bool is_peeled = 3; |
| |
| // If is_peeled is true, this field is accurate. This field |
| // exists only if target points to annotated tag object, then |
| // this field stores the "object" field for that tag. |
| // |
| optional Id peeled = 4; |
| } |
| |
| |
| // Entry in ObjectIndexTable, describes how an object appears in a chunk. |
| // |
| <a name="message_ObjectInfo">message ObjectInfo</a> { |
| // Type of Git object. |
| // |
| enum ObjectType { |
| COMMIT = 1; |
| TREE = 2; |
| BLOB = 3; |
| TAG = 4; |
| } |
| optional ObjectType object_type = 1; |
| |
| // Position of the object's header within its chunk. |
| // |
| required int32 offset = 2; |
| |
| // Total number of compressed data bytes, not including the pack |
| // header. For fragmented objects this is the sum of all chunks. |
| // |
| required int64 packed_size = 3; |
| |
| // Total number of bytes of the uncompressed object. For a |
| // delta this is the size after applying the delta onto its base. |
| // |
| required int64 inflated_size = 4; |
| |
| // ObjectId of the delta base, if this object is stored as a delta. |
| // The base is stored in raw binary. |
| // |
| optional bytes delta_base = 5; |
| } |
| |
| |
| // Describes at a high-level the information about a chunk. |
| // A repository can use this summary to determine how much |
| // data is stored, or when garbage collection should occur. |
| // |
| <a name="message_ChunkInfo">message ChunkInfo</a> { |
| // Source of the chunk (what code path created it). |
| // |
| enum Source { |
| RECEIVE = 1; // Came in over the network from external source. |
| INSERT = 2; // Created in this repository (e.g. a merge). |
| REPACK = 3; // Generated during a repack of this repository. |
| } |
| optional Source source = 1; |
| |
| // Type of Git object stored in this chunk. |
| // |
| enum ObjectType { |
| MIXED = 0; |
| COMMIT = 1; |
| TREE = 2; |
| BLOB = 3; |
| TAG = 4; |
| } |
| optional ObjectType object_type = 2; |
| |
| // True if this chunk is a member of a fragmented object. |
| // |
| optional bool is_fragment = 3; |
| |
| // If present, key of the CachedPackInfo object |
| // that this chunk is a member of. |
| // |
| optional string cached_pack_key = 4; |
| |
| // Summary description of the objects stored here. |
| // |
| message ObjectCounts { |
| // Number of objects stored in this chunk. |
| // |
| optional int32 total = 1; |
| |
| // Number of objects stored in whole (non-delta) form. |
| // |
| optional int32 whole = 2; |
| |
| // Number of objects stored in OFS_DELTA format. |
| // The delta base appears in the same chunk, or |
| // may appear in an earlier chunk through the |
| // ChunkMeta.base_chunk link. |
| // |
| optional int32 ofs_delta = 3; |
| |
| // Number of objects stored in REF_DELTA format. |
| // The delta base is at an unknown location. |
| // |
| optional int32 ref_delta = 4; |
| } |
| optional ObjectCounts object_counts = 5; |
| |
| // Size in bytes of the chunk's compressed data column. |
| // |
| optional int32 chunk_size = 6; |
| |
| // Size in bytes of the chunk's index. |
| // |
| optional int32 index_size = 7; |
| |
| // Size in bytes of the meta information. |
| // |
| optional int32 meta_size = 8; |
| } |
| |
| |
| // Describes meta information about a chunk, stored inline with it. |
| // |
| <a name="message_ChunkMeta">message ChunkMeta</a> { |
| // Enumerates the other chunks this chunk depends upon by OFS_DELTA. |
| // Entries are sorted by relative_start ascending, enabling search. Thus |
| // the earliest chunk is at the end of the list. |
| // |
| message BaseChunk { |
| // Bytes between start of the base chunk and start of this chunk. |
| // Although the value is positive, its a negative offset. |
| // |
| required int64 relative_start = 1; |
| required string chunk_key = 2; |
| } |
| repeated BaseChunk base_chunk = 1; |
| |
| // If this chunk is part of a fragment, key of every chunk that |
| // makes up the fragment, including this chunk. |
| // |
| repeated string fragment = 2; |
| |
| // Chunks that should be prefetched if reading the current chunk. |
| // |
| message PrefetchHint { |
| repeated string edge = 1; |
| repeated string sequential = 2; |
| } |
| optional PrefetchHint commit_prefetch = 51; |
| optional PrefetchHint tree_prefetch = 52; |
| } |
| |
| |
| // Describes a CachedPack, for efficient bulk clones. |
| // |
| <a name="message_CachedPackInfo">message CachedPackInfo</a> { |
| // Unique name of the cached pack. This is the SHA-1 hash of |
| // all of the objects that make up the cached pack, sorted and |
| // in binary form. (Same rules as Git on the filesystem.) |
| // |
| required string name = 1; |
| |
| // SHA-1 of all chunk keys, which are themselves SHA-1s of the |
| // raw chunk data. If any bit differs in compression (due to |
| // repacking) the version will differ. |
| // |
| required string version = 2; |
| |
| // Total number of objects in the cached pack. This must be known |
| // in order to set the final resulting pack header correctly before it |
| // is sent to clients. |
| // |
| required int64 objects_total = 3; |
| |
| // Number of objects stored as deltas, rather than deflated whole. |
| // |
| optional int64 objects_delta = 4; |
| |
| // Total size of the chunks, in bytes, not including the chunk footer. |
| // |
| optional int64 bytes_total = 5; |
| |
| // Objects this pack starts from. |
| // |
| message TipObjectList { |
| repeated string object_name = 1; |
| } |
| required TipObjectList tip_list = 6; |
| |
| // Chunks, in order of occurrence in the stream. |
| // |
| message ChunkList { |
| repeated string chunk_key = 1; |
| } |
| required ChunkList chunk_list = 7; |
| } |
| </pre> |
| |
| </body> |
| </html> |