org.eclipse.jgit.storage.dht/resources/org/eclipse/jgit/storage/dht/dht-schema.html - jgit - Git at Google

 <html>
 <head>
 <title>Git on DHT Schema</title>

 <style type='text/css'>
 body { font-size: 10pt; }
 h1   { font-size: 16pt; }
 h2   { font-size: 12pt; }
 h3   { font-size: 10pt; }

 body {
   margin-left: 8em;
   margin-right: 8em;
 }
 h1 { margin-left: -3em; }
 h2 { margin-left: -2em; }
 h3 { margin-left: -1em; }
 hr { margin-left: -4em; margin-right: -4em; }

 .coltoc {
   font-size: 8pt;
   font-family: monospace;
 }

 .rowkey {
   margin-left: 1em;
   padding-top: 0.2em;
   padding-left: 1em;
   padding-right: 1em;
   width: 54em;
   border: 1px dotted red;
   background-color: #efefef;
   white-space: nowrap;
 }
 .rowkey .header {
   font-weight: bold;
   padding-right: 1em;
 }
 .rowkey .var {
   font-style: italic;
   font-family: monospace;
 }
 .rowkey .lit {
   font-weight: bold;
   font-family: monospace;
 }
 .rowkey .example {
   font-family: monospace;
 }
 .rowkey p {
   white-space: normal;
 }

 .colfamily {
   margin-top: 0.5em;
   padding-top: 0.2em;
   padding-left: 1em;
   padding-right: 1em;
   width: 55em;
   border: 1px dotted blue;
   background-color: #efefef;
   white-space: nowrap;
 }
 .colfamily .header {
   font-weight: bold;
   padding-right: 1em;
 }
 .colfamily .var {
   font-style: italic;
   font-family: monospace;
 }
 .colfamily .lit {
   font-family: monospace;
 }
 .colfamily .example {
   font-family: monospace;
 }
 .colfamily p {
   white-space: normal;
 }

 .summary_table {
   border-collapse: collapse;
   border-spacing: 0;
 }
 .summary_table .desc {
   font-size: 8pt;
   white-space: nowrap;
   text-align: right;
   width: 20em;
 }
 .summary_table td {
   border: 1px dotted lightgray;
   padding-top: 2px;
   padding-bottom: 2px;
   padding-left: 5px;
   padding-right: 5px;
   vertical-align: top;
 }
 .summary_table tr.no_border td {
   border: none;
 }
 </style>
 </head>
 <body>

 <h1>Git on DHT Schema</h1>

 <p>Storing Git repositories on a Distributed Hash Table (DHT) may
 improve scaling for large traffic, but also simplifies management when
 there are many small repositories.</p>

 <h2>Table of Contents</h2>
 <ul>
   <li><a href="#concepts">Concepts</a></li>
   <li><a href="#summary">Summary</a></li>
   <li><a href="#security">Data Security</a></li>

   <li>Tables:
   <ul>
     <li><a href="#REPOSITORY_INDEX">Table REPOSITORY_INDEX</a>
       (
         <a href="#REPOSITORY_INDEX.id" class="toccol">id</a>
       )</li>

     <li><a href="#REPOSITORY">Table REPOSITORY</a>
       (
         <a href="#REPOSITORY.chunk-info" class="toccol">chunk-info</a>,
         <a href="#REPOSITORY.cached-pack" class="toccol">cached-pack</a>
       )</li>

     <li><a href="#REF">Table REF</a>
       (
         <a href="#REF.target" class="toccol">target</a>
       )</li>

     <li><a href="#OBJECT_INDEX">Table OBJECT_INDEX</a>
       (
         <a href="#OBJECT_INDEX.info" class="toccol">info</a>
       )</li>

     <li><a href="#CHUNK">Table CHUNK</a>
       (
         <a href="#CHUNK.chunk" class="toccol">chunk</a>,
         <a href="#CHUNK.index" class="toccol">index</a>,
         <a href="#CHUNK.meta" class="toccol">meta</a>
       )</li>
     </ul>
   </li>

   <li>Protocol Messages:
     <ul>
     <li><a href="#message_RefData">RefData</a></li>
     <li><a href="#message_ObjectInfo">ObjectInfo</a></li>
     <li><a href="#message_ChunkInfo">ChunkInfo</a></li>
     <li><a href="#message_ChunkMeta">ChunkMeta</a></li>
     <li><a href="#message_CachedPackInfo">CachedPackInfo</a></li>
     </ul>
   </li>
 </ul>

 <a name="concepts"><h2>Concepts</h2></a>

 <p><i>Git Repository</i>: Stores the version control history for a
 single project. Each repository is a directed acyclic graph (DAG)
 composed of objects. Revision history for a project is described by a
 commit object pointing to the complete set of files that make up that
 version of the project, and a pointer to the commit that came
 immediately before it. Repositories also contain references,
 associating a human readable branch or tag name to a specific commit
 object. Tommi Virtanen has a
 <a href="http://eagain.net/articles/git-for-computer-scientists/">more
 detailed description of the Git DAG</a>.</p>

 <p><i>Object</i>: Git objects are named by applying the SHA-1 hash
 algorithm to their contents. There are 4 object types: commit, tree,
 blob, tag. Objects are typically stored deflated using libz deflate,
 but may also be delta compressed against another similar object,
 further reducing the storage required. The big factor for Git
 repository size is usually object count, e.g. the linux-2.6 repository
 contains 1.8 million objects.</p>

 <p><i>Reference</i>: Associates a human readable symbolic name, such
 as <code>refs/heads/master</code> to a specific Git object, usually a
 commit or tag. References are updated to point to the most recent
 object whenever changes are committed to the repository.</p>

 <p><i>Git Pack File</i>: A container stream holding many objects in a
 highly compressed format. On the local filesystem, Git uses pack files
 to reduce both inode and space usage by combining millions of objects
 into a single data stream. On the network, Git uses pack files as the
 basic network protocol to transport objects from one system's
 repository to another.</p>

 <p><i>Garbage Collection</i>: Scanning the Git object graph to locate
 objects that are reachable, and others that are unreachable. Git also
 generally performs data recompression during this task to produce more
 optimal deltas between objects, reducing overall disk usage and data
 transfer sizes. This is independent of any GC that may be performed by
 the DHT to clean up old cells.</p>

 <p>The basic storage strategy employed by this schema is to break a
 standard Git pack file into chunks, approximately 1 MiB in size.  Each
 chunk is stored as one row in the <a href="#CHUNK">CHUNK</a> table.
 During reading, chunks are paged into the application on demand, but
 may also be prefetched using prefetch hints. Rules are used to break
 the standard pack into chunks, these rules help to improve reference
 locality and reduce the number of chunk loads required to service
 common operations. In a nutshell, the DHT is used as a virtual memory
 system for pages about 1 MiB in size.</p>

 <a name="summary"><h2>Summary</h2></a>

 <p>The schema consists of a handful of tables. Size estimates are
 given for one copy of the linux-2.6 Git repository, a relative tortue
 test case that contains 1.8 million objects and is 425 MiB when stored
 on the local filesystem. All sizes are before any replication made by
 the DHT, or its underlying storage system.</p>

 <table style='margin-left: 2em' class='summary_table'>
   <tr>
     <th>Table</th>
     <th>Rows</th>
     <th>Cells/Row</th>
     <th>Bytes</th>
     <th>Bytes/Row</th>
   </tr>

   <tr>
     <td><a href="#REPOSITORY_INDEX">REPOSITORY_INDEX</a>
     <div class='desc'>Map host+path to surrogate key.</div></td>
     <td align='right'>1</td>
     <td align='right'>1</td>
     <td align='right'>&lt; 100 bytes</td>
     <td align='right'>&lt; 100 bytes</td>
   </tr>

   <tr>
     <td><a href="#REPOSITORY">REPOSITORY</a>
     <div class='desc'>Accounting and replica management.</div></td>
     <td align='right'>1</td>
     <td align='right'>403</td>
     <td align='right'>65 KiB</td>
     <td align='right'>65 KiB</td>
   </tr>

   <tr>
     <td><a href="#REF">REF</a>
     <div class='desc'>Bind branch/tag name to Git object.</div></td>
     <td align='right'>211</td>
     <td align='right'>1</td>
     <td align='right'>14 KiB</td>
     <td align='right'>67 bytes</td>
   </tr>

   <tr>
     <td><a href="#OBJECT_INDEX">OBJECT_INDEX</a>
     <div class='desc'>Locate Git object by SHA-1 name.</div></td>
     <td align='right'>1,861,833</td>
     <td align='right'>1</td>
     <td align='right'>154 MiB</td>
     <td align='right'>87 bytes</td>
   </tr>

   <tr>
     <td><a href="#CHUNK">CHUNK</a>
     <div class='desc'>Complete Git object storage.</div></td>
     <td align='right'>402</td>
     <td align='right'>3</td>
     <td align='right'>417 MiB</td>
     <td align='right'>~ 1 MiB</td>
   </tr>

   <tr class='no_border'>
     <td align='right'><i>Total</i></td>
     <td align='right'>1,862,448</td>
     <td align='right'></td>
     <td align='right'>571 MiB</td>
     <td align='right'></td>
   </tr>
 </table>

 <a name="security"><h2>Data Security</h2></a>

 <p>If data encryption is necessary to protect file contents, the <a
 href="#CHUNK.chunk">CHUNK.chunk</a> column can be encrypted with a
 block cipher such as AES. This column contains the revision commit
 messages, file paths, and file contents. By encrypting one column, the
 majority of the repository data is secured. As each cell value is
 about 1 MiB and contains a trailing 4 bytes of random data, an ECB
 mode of operation may be sufficient. Because the cells are already
 very highly compressed using the Git data compression algorithms,
 there is no increase in disk usage due to encryption.</p>

 <p>Branch and tag names (<a href="#REF">REF</a> row keys) are not
 encrypted. If these need to be secured the portion after the ':' would
 need to be encrypted with a block cipher. However these strings are
 very short and very common (HEAD, refs/heads/master, refs/tags/v1.0),
 making encryption difficult. A variation on the schema might move all
 rows for a repository into a single protocol messsage, then encrypt
 the protobuf into a single cell. Unfortunately this strategy has a
 high update cost, and references change frequently.</p>

 <p>Object SHA-1 names (<a href="#OBJECT_INDEX">OBJECT_INDEX</a> row
 keys and <a href="#CHUNK.index">CHUNK.index</a> values) are not
 encrypted. This allows a reader to determine if a repository contains
 a specific revision, but does not allow them to inspect the contents
 of the revision. The CHUNK.index column could also be encrypted with a
 block cipher when CHUNK.chunk is encrypted (see above), however the
 OBJECT_INDEX table row keys cannot be encrypted if abbrevation
 expansion is to be supported for end-users of the repository. The row
 keys must be unencrypted as abbreviation resolution is performed by a
 prefix range scan over the keys.</p>

 <p>The remaining tables and columns contain only statistics (e.g.
 object counts or cell sizes), or internal surrogate keys
 (repository_id, chunk_key) and do not require encryption.</p>

 <hr />
 <a name="REPOSITORY_INDEX"><h2>Table REPOSITORY_INDEX</h2></a>

 <p>Maps a repository name, as presented in the URL by an end-user or
 client application, into its internal repository_id value. This
 mapping allows the repository name to be quickly modified (e.g.
 renamed) without needing to update the larger data rows of the
 repository.</p>

 <p>The exact semantics of the repository_name format is left as a
 deployment decision, but DNS hostname, '/', repository name would be
 one common usage.</p>

 <h3>Row Key</h3>
 <div class='rowkey'>
   <div>
   <span class='header'>Row Key:</span>
   <span class='var'>repository_name</span>
   </div>

   <p>Human readable name of the repository, typically derived from the
   HTTP <code>Host</code> header and path in the URL.</p>

   <p>Examples:</p>
   <ul>
   <li><span class='example'>com.example.git/pub/git/foo.git</span></li>
   <li><span class='example'>com.example.git/~user/mystuff.git</span></li>
   </ul>
 </div>

 <h3>Columns</h3>
 <div class='colfamily'>
   <div>
   <span class='header'>Column:</span>
   <a name="REPOSITORY_INDEX.id"><span class='lit'>id:</span></a>
   </div>

   <p>The repository_id, as an 8-digit hex ASCII string.</p>
 </div>

 <h3>Size Estimate</h3>

 <p>Less than 100,000 rows. More likely estimate is 1,000 rows.
 Total disk usage under 512 KiB, assuming 1,000 names and 256
 characters per name.</p>

 <h3>Updates</h3>

 <p>Only on repository creation or rename, which is infrequent (&lt;10
 rows/month). Updates are performed in a row-level transaction, to
 ensure a name is either assigned uniquely, or fails.</p>

 <h3>Reads</h3>

 <p>Reads are tried first against memcache, then against the DHT if the
 entry did not exist in memcache. Successful reads against the DHT are
 put back into memcache in the background.</p>

 <a name="REPOSITORY"><h2>Table REPOSITORY</h2></a>

 <p>Tracks top-level information about each repository.</p>

 <h3>Row Key</h3>
 <div class='rowkey'>
   <div>
   <span class='header'>Row Key:</span>
   <span class='var'>repository_id</span>
   </div>

   <p>The repository_id, as an 8-digit hex ASCII string.</p>
 </div>

 <p>Typically this is assigned sequentially, then has the bits reversed
 to evenly spread repositories throughout the DHT.  For example the
 first repository is <code>80000000</code>, and the second is
 <code>40000000</code>.</p>

 <h3>Columns</h3>
 <div class='colfamily'>
   <div>
   <span class='header'>Column:</span>
   <a name="REPOSITORY.chunk-info"><span class='lit'>chunk-info:</span></a>
   <span class='var'>chunk_key[short]</span>
   </div>

   <p>Cell value is the protocol message <a
   href="#message_ChunkInfo">ChunkInfo</a> describing the chunk's
   contents. Most of the message's fields are only useful for quota
   accounting and reporting.</p>
 </div>

 <p>This column exists to track all of the chunks that make up a
 repository's object set. Garbage collection and quota accounting tasks
 can primarily drive off this column, rather than scanning the much
 larger <a href="#CHUNK">CHUNK</a> table with a regular expression on
 the chunk row key.</p>

 <p>As each chunk averages 1 MiB in size, the linux-2.6 repository
 (at 373 MiB) has about 400 chunks and thus about 400 chunk-info
 cells. The chromium repository (at 1 GiB) has about 1000 chunk-info
 cells. It would not be uncommon to have 2000 chunk-info cells.</p>

 <div class='colfamily'>
   <div>
   <span class='header'>Column:</span>
   <a name="REPOSITORY.cached-pack"><span class='lit'>cached-pack:</span></a>
   <span class='var'>NNNNx38</span>
   <span class='lit'>.</span>
   <span class='var'>VVVVx38</span>
   </div>

   <p>Variables:</p>
   <ul>
   <li><span class='var'>NNNNx38</span> = 40 hex digit name of the cached pack</li>
   <li><span class='var'>VVVVx38</span> = 40 hex digit version of the cached pack</li>
   </ul>

   <p>Examples:</p>
   <ul>
   <li><span class='example'>4e32fb97103981e7dd53dcc786640fa4fdb444b8.8975104a03d22e54f7060502e687599d1a2c2516</span></li>
   </ul>

   <p>Cell value is the protocol message <a
   href="#message_CachedPackInfo">CachedPackInfo</a> describing the
   chunks that make up a cached pack.</p>
 </div>

 <p>The <code>cached-pack</code> column family describes large lists of
 chunks that when combined together in a specific order create a valid
 Git pack file directly streamable to a client. This avoids needing to
 enumerate and pack the entire repository on each request.</p>

 <p>The cached-pack name (NNNNx38 above) is the SHA-1 of the objects
 contained within the pack, in binary, sorted. This is the standard
 naming convention for pack files on the local filesystem. The version
 (VVVVx38 above) is the SHA-1 of the chunk keys, sorted. The version
 makes the cached-pack cell unique, if any single bit in the compressed
 data is modified a different version will be generated, and a
 different cell will be used to describe the alternate version of the
 same data. The version is necessary to prevent repacks of the same
 object set (but with different compression settings or results) from
 stepping on active readers.</p>

 <h2>Size Estimate</h2>

 <p>1 row per repository (~1,000 repositories), however the majority of
 the storage cost is in the <code>chunk-info</code> column family,
 which can have more than 2000 cells per repository.</p>

 <p>Each <code>chunk-info</code> cell is on average 147 bytes. For a
 large repository like chromium.git (over 1000 chunks) this is only 147
 KiB for the entire row.</p>

 <p>Each <code>cached-pack</code> cell is on average 5350 bytes. Most
 repositories have 1 of these cells, 2 while the repository is being
 repacked on the server side to update the cached-pack data.</p>

 <h2>Updates</h2>

 <p>Information about each ~1 MiB chunk of pack data received over the
 network is stored as a unique column in the <code>chunk-info</code>
 column family.</p>

 <p>Most pushes are at least 2 chunks (commit, tree), with 50 pushes
 per repository per day being possible (50,000 new cells/day).</p>

 <p><b>TODO:</b> Average push rates?</p>

 <h2>Reads</h2>

 <p><i>Serving clients:</i> Read all cells in the
 <code>cached-pack</code> column family, typically only 1-5 cells. The
 cells are cached in memcache and read from there first.</p>

 <p><i>Garbage collection:</i> Read all cells in the
 <code>chunk-info</code> column family to determine which chunks are
 owned by this repository, without scanning the <a href="#CHUNK">CHUNK</a> table.
 Delete <code>chunk-info</code> after the corresponding <a href="#CHUNK">CHUNK</a>
 row has been deleted. Unchanged chunks have their info left alone.</p>

 <a name="REF"><h2>Table REF</h2></a>

 <p>Associates a human readable branch (e.g.
 <code>refs/heads/master</code>) or tag (e.g.
 <code>refs/tags/v1.0</code>) name to the Git
 object that represents that current state of
 the repository.</p>

 <h3>Row Key</h3>
 <div class='rowkey'>
   <div>
   <span class='header'>Row Key:</span>
   <span class='var'>repository_id</span>
   <span class='lit'>:</span>
   <span class='var'>ref_name</span>
   </div>

   <p>Variables:</p>
   <ul>
   <li><span class='var'>repository_id</span> = Repository owning the reference (see above)</li>
   <li><span class='var'>ref_name</span> = Name of the reference, UTF-8 string</li>
   </ul>

   <p>Examples:</p>
   <ul>
   <li><span class='example'>80000000:HEAD</span></li>
   <li><span class='example'>80000000:refs/heads/master</span></li>
   <br />
   <li><span class='example'>40000000:HEAD</span></li>
   <li><span class='example'>40000000:refs/heads/master</span></li>
   </ul>
 </div>

 <p>The separator <code>:</code> used in the row key was chosen because
 this character is not permitted in a Git reference name.</p>

 <h3>Columns</h3>
 <div class='colfamily'>
   <div>
   <span class='header'>Column:</span>
   <a name="REF.target"><span class='lit'>target:</span></a>
   </div>

   <p>Cell value is the protocol message
   <a href="#message_RefData">RefData</a> describing the
   current SHA-1 the reference points to, and the chunk
   it was last observed in. The chunk hint allows skipping
   a read of <a href="#OBJECT_INDEX">OBJECT_INDEX</a>.</p>

   <p>Several versions (5) are stored for emergency rollbacks.
   Additional versions beyond 5 are cleaned up during table garbage
   collection as managed by the DHT's cell GC.</p>
 </div>

 <h3>Size Estimate</h3>

 <p><i>Normal Git usage:</i> ~10 branches per repository, ~200 tags.
 For 1,000 repositories, about 200,000 rows total. Average row size is
 about 240 bytes/row before compression (67 after), or 48M total.</p>

 <p><i>Gerrit Code Review usage:</i> More than 300 new rows per day.
 Each snapshot of each change under review is one reference.</p>

 <h3>Updates</h3>

 <p>Writes are performed by doing an atomic compare-and-swap (through a
 transaction), changing the RefData protocol buffer.</p>

 <h3>Reads</h3>

 <p>Reads perform prefix scan for all rows starting with
 <code>repository_id:</code>. Plans exist to cache these reads within a
 custom service, avoiding most DHT queries.</p>

 <a name="OBJECT_INDEX"><h2>Table OBJECT_INDEX</h2></a>

 <p>The Git network protocol has clients sending object SHA-1s to the
 server, with no additional context or information. End-users may also
 type a SHA-1 into a web search box. This table provides a mapping of
 the object SHA-1 to which chunk(s) store the object's data. The table
 is sometimes also called the 'global index', since it names where
 every single object is stored.</p>

 <h3>Row Key</h3>
 <div class='rowkey'>
   <div>
   <span class='header'>Row Key:</span>
   <span class='var'>NN</span>
   <span class='lit'>.</span>
   <span class='var'>repository_id</span>
   <span class='lit'>.</span>
   <span class='var'>NNx40</span>
   </div>

   <p>Variables:</p>
   <ul>
   <li><span class='var'>NN</span> = First 2 hex digits of object SHA-1</li>
   <li><span class='var'>repository_id</span> = Repository owning the object (see above)</li>
   <li><span class='var'>NNx40</span> = Complete object SHA-1 name, in hex</li>
   </ul>

   <p>Examples:</p>
   <ul>
   <li><span class='example'>2b.80000000.2b5c9037c81c38b3b9abc29a3a87a4abcd665ed4</span></li>
   <li><span class='example'>8f.40000000.8f270a441569b127cc4af8a6ef601d94d9490efb</span></li>
   </ul>
 </div>

 <p>The first 2 hex digits (<code>NN</code>) distribute object keys
 within the same repository around the DHT keyspace, preventing a busy
 repository from creating too much of a hot-spot within the DHT.  To
 simplify key generation, these 2 digits are repeated after the
 repository_id, as part of the 40 hex digit object name.</p>

 <p>Keys must be clustered by repository_id to support extending
 abbreviations. End-users may supply an abbreviated SHA-1 of 4 or more
 digits (up to 39) and ask the server to complete them to a full 40
 digit SHA-1 if the server has the relevant object within the
 repository's object set.</p>

 <p>A schema variant that did not include the repository_id as part of
 the row key was considered, but discarded because completing a short
 4-6 digit abbreviated SHA-1 would be impractical once there were
 billions of objects stored in the DHT. Git end-users expect to be able
 to use 4 or 6 digit abbreviations on very small repositories, as the
 number of objects is low and thus the number of bits required to
 uniquely name an object within that object set is small.</p>

 <h3>Columns</h3>
 <div class='colfamily'>
   <div>
   <span class='header'>Column:</span>
   <a name="OBJECT_INDEX.info"><span class='lit'>info:</span></a>
   <span class='var'>chunk_key[short]</span>
   </div>

   <p>Cell value is the protocol message
   <a href="#message_ObjectInfo">ObjectInfo</a> describing how the object
   named by the row key is stored in the chunk named by the column name.</p>

   <p>Cell timestamp matters. The <b>oldest cell</b> within the
   entire column family is favored during reads. As chunk_key is
   unique, versions within a single column aren't relevant.</p>
 </div>

 <h3>Size Estimate</h3>

 <p>Average row size per object/chunk pair is 144 bytes uncompressed
 (87 compressed), based on the linux-2.6 repository. The linux-2.6
 repository has 1.8 million objects, and is growing at a rate of about
 300,000 objects/year. Total usage for linux-2.6 is above 154M.</p>

 <p>Most rows contain only 1 cell, as the object appears in only 1
 chunk within that repository.</p>

 <p><i>Worst case:</i> 1.8 million rows/repository * 1,000 repositories
 is around 1.8 billion rows and 182G.</p>

 <h3>Updates</h3>

 <p>One write per object received over the network; typically performed
 as part of an asynchronous batch. Each batch is sized around 512 KiB
 (about 3000 rows). Because of SHA-1's uniform distribution, row keys
 are first sorted and then batched into buckets of about 3000 rows. To
 prevent too much activity from going to one table segment at a time
 the complete object list is segmented into up to 32 groups which are
 written in round-robin order.</p>

 <p>A full push of the linux-2.6 repository writes 1.8 million
 rows as there are 1.8 million objects in the pack stream.</p>

 <p>During normal insert or receive operations, each received object is
 a blind write to add one new <code>info:chunk_key[short]</code> cell
 to the row.  During repack, all cells in the <code>info</code> column
 family are replaced with a single cell.</p>

 <h3>Reads</h3>

 <p>During common ancestor negotiation reads occur in batches of 64-128
 full row keys, uniformly distributed throughout the key space. Most of
 these reads are misses, the OBJECT_INDEX table does not contain the
 key offered by the client. A successful negotation for most developers
 requires at least two rounds of 64 objects back-to-back over HTTP. Due
 to the high miss rate on this table, an in-memory bloom filter may be
 important for performance.</p>

 <p>To support the high read-rate (and high miss-rate) during common
 ancestor negotation, an alternative to an in-memory bloom filter
 within the DHT is to downoad the entire set of keys into an alternate
 service job for recently accessed repositories. This service can only
 be used if <i>all</i> of the keys for the same repository_id are
 hosted within the service. Given this is under 36 GiB for the worst
 case 1.8 billion rows mentioned above, this may be feasible. Loading
 the table can be performed by fetching <a
 href="#REPOSITORY.chunk-info">REPOSITORY.chunk-info</a> and then
 performing parallel gets for the <a
 href="#CHUNK.index">CHUNK.index</a> column, and scanning the local
 indexes to construct the list of known objects.</p>

 <p>During repacking with no delta reuse, worst case scenario requires
 reading all records with the same repository_id (for linux-2.6 this
 is 1.8 million rows). Reads are made in a configurable batch size,
 right now this is set at 2048 keys/batch, with 4 concurrent batches in
 flight at a time.</p>

 <p>Reads are tried first against memcache, then against the DHT if the
 entry did not exist in memcache. Successful reads against the DHT are
 put back into memcache in the background.</p>

 <a name="CHUNK"><h2>Table CHUNK</h2></a>

 <p>Stores the object data for a repository, containing commit history,
 directory structure, and file revisions. Each chunk is typically 1 MiB
 in size, excluding the index and meta columns.</p>

 <h3>Row Key</h3>
 <div class='rowkey'>
   <div>
   <span class='header'>Row Key:</span>
   <span class='var'>HH</span>
   <span class='lit'>.</span>
   <span class='var'>repository_id</span>
   <span class='lit'>.</span>
   <span class='var'>HHx40</span>
   </div>

   <p>Variables:</p>
   <ul>
   <li><span class='var'>HH</span> = First 2 hex digits of chunk SHA-1</li>
   <li><span class='var'>repository_id</span> = Repository owning the chunk (see above)</li>
   <li><span class='var'>HHx40</span> = Complete chunk SHA-1, in hex</li>
   </ul>

   <p>Examples:</p>
   <ul>
   <li><span class='example'>09.80000000.09e0eb57543be633b004b672cbebdf335aa4d53f</span> <i>(full key)</i></li>
   </ul>
 </div>

 <p>Chunk keys are computed by first computing the SHA-1 of the
 <code>chunk:</code> column, which is the compressed object contents
 stored within the chunk. As the chunk data includes a 32 bit salt in
 the trailing 4 bytes, this value is random even for the exact same
 object input.</p>

 <p>The leading 2 hex digit <code>HH</code> component distributes
 chunks for the same repository (and over the same time period) evenly
 around the DHT keyspace, preventing any portion from becoming too
 hot.</p>

 <h3>Columns</h3>
 <div class='colfamily'>
   <div>
   <span class='header'>Column:</span>
   <a name="CHUNK.chunk"><span class='lit'>chunk:</span></a>
   </div>

   <p>Multiple objects in Git pack-file format, about 1 MiB in size.
   The data is already very highly compressed by Git and is not further
   compressable by the DHT.</p>
 </div>

 <p>This column is essentially the standard Git pack-file format,
 without the standard header or trailer. Objects can be stored in
 either whole format (object content is simply deflated inline)
 or in delta format (reference to a delta base is followed by
 deflated sequence of copy and/or insert instructions to recreate
 the object content). The OBJ_OFS_DELTA format is preferred
 for deltas, since it tends to use a shorter encoding than the
 OBJ_REF_DELTA format. Offsets beyond the start of the chunk are
 actually offsets to other chunks, and must be resolved using the
 <code>meta.base_chunk.relative_start</code> field.</p>

 <p>Because the row key is derived from the SHA-1 of this column, the
 trailing 4 bytes is randomly generated at insertion time, to make it
 impractical for remote clients to predict the name of the chunk row.
 This allows the stream parser to bindly insert rows without first
 checking for row existance, or worry about replacing an existing
 row and causing data corruption.</p>

 <p>This column value is essentially opaque to the DHT.</p>

 <div class='colfamily'>
   <div>
   <span class='header'>Column:</span>
   <a name="CHUNK.index"><span class='lit'>index:</span></a>
   </div>

   <p>Binary searchable table listing object SHA-1 and starting offset
   of that object within the <code>chunk:</code> data stream. The data
   in this index is essentially random (due to the SHA-1s stored in
   binary) and thus is not compressable.</p>
 </div>

 <p>Sorted list of SHA-1s of each object that is stored in this chunk,
 along with the offset. This column allows efficient random access to
 any object within the chunk, without needing to perform a remote read
 against <a href="#OBJECT_INDEX">OBJECT_INDEX</a> table. The column is
 very useful at read time, where pointers within Git objects will
 frequently reference other objects stored in the same chunk.</p>

 <p>This column is sometimes called the local index, since it is local
 only to the chunk and thus differs from the global index stored in the
 <a href="#OBJECT_INDEX">OBJECT_INDEX</a> table.</p>

 <p>The column size is 24 bytes per object stored in the chunk. Commit
 chunks store on average 2200 commits/chunk, so a commit chunk index is
 about 52,800 bytes.</p>

 <p>This column value is essentially opaque to the DHT.</p>

 <div class='colfamily'>
   <div>
   <span class='header'>Column:</span>
   <a name="CHUNK.meta"><span class='lit'>meta:</span></a>
   </div>

   <p>Cell value is the protocol message
   <a href="#message_ChunkMeta">ChunkMeta</a> describing prefetch
   hints, object fragmentation, and delta base locations. Unlike
   <code>chunk:</code> and <code>index:</code>, this column is
   somewhat compressable.</p>
 </div>

 <p>The meta column provides information critical for reading the
 chunk's data. (Unlike <a href="#message_ChunkInfo">ChunkInfo</a> in
 the <a href="#REPOSITORY">REPOSITORY</a> table, which is used only for
 accounting.)</p>

 <p>The most important element is the BaseChunk nested message,
 describing a chunk that contains a base object required to inflate
 an object that is stored in this chunk as a delta.</p>

 <h3>Chunk Contents</h3>

 <p>Chunks try to store only a single object type, however mixed object
 type chunks are supported. The rule to store only one object type per
 chunk improves data locality, reducing the number of chunks that need
 to be accessed from the DHT in order to perform a particular Git
 operation. Clustering commits together into a 'commit chunk' improves
 data locality during log/history walking operations, while clustering
 trees together into a 'tree chunk' improves data locality during the
 early stages of packing or difference generation.</p>

 <p>Chunks reuse the standard Git pack data format to support direct
 streaming of a chunk's <code>chunk:</code> column to clients, without
 needing to perform any data manipulation on the server. This enables
 high speed data transfer from the DHT to the client.</p>

 <h3>Large Object Fragmentation</h3>

 <p>If a chunk contains more than one object, all objects within the
 chunk must store their complete compressed form within the chunk. This
 limits an object to less than 1 MiB of compressed data.</p>

 <p>Larger objects whose compressed size is bigger than 1 MiB are
 fragmented into multiple chunks. The first chunk contains the object's
 pack header, and the first 1 MiB of compressed data. Subsequent data
 is stored in additional chunks. The additional chunk keys are stored
 in the <code>meta.fragment</code> field. Each chunk that is part of
 the same large object redundantly stores the exact same meta
 value.</p>

 <h3>Size Estimate</h3>

 <p>Approximately the same size if the repository was stored on the
 local filesystem. For the linux-2.6 repository (373M / 1.8 million
 objects), about 417M (373.75M in <code>chunk:</code>, 42.64M in
 <code>index:</code>, 656K in <code>meta:</code>).</p>

 <p>Row count is close to size / 1M (373M / 1M = 373 rows), but may be
 slightly higher (e.g. 402) due to fractional chunks on the end of
 large fragmented objects, or where the single object type rule caused a
 chunk to close before it was full.</p>

 <p>For the complete Android repository set, disk usage is ~13G.</p>

 <h3>Updates</h3>

 <p>This table is (mostly) append-only. Write operations blast in ~1
 MiB chunks, as the key format assures writers the new row does not
 already exist. Chunks are randomly scattered by the hashing function,
 and are not buffered very deep by writers.</p>

 <p><i>Interactive writes:</i> Small operations impacting only 1-5
 chunks will write all columns in a single operation. Most chunks of
 this varity will be very small, 1-10 objects per chunk and about 1-10
 KiB worth of compressed data inside of the <code>chunk:</code> column.
 This class of write represents a single change made by one developer
 that must be shared back out immediately.</p>

 <p><i>Large pushes:</i> Large operations impacting tens to hundreds of
 chunks will first write the <code>chunk:</code> column, then come back
 later and populate the <code>index:</code> and <code>meta:</code>
 columns once all chunks have been written. The delayed writing of
 index and meta during large operations is required because the
 information for these columns is not available until the entire data
 stream from the Git client has been received and scanned. As the Git
 server may not have sufficient memory to store all chunk data (373M or
 1G!), its written out first to free up memory.</p>

 <p><i>Garbage collection:</i> Chunks that are not optimally sized
 (less than the target ~1 MiB), optimally localized (too many graph
 pointers outside of the chunk), or compressed (Git found a smaller way
 to store the same content) will be replaced by first writing new
 chunks, and then deleting the old chunks.</p>

 <p>Worst case, this could churn as many as 402 rows and 373M worth of
 data for the linux-2.6 repository. Special consideration will be made
 to try and avoid replacing chunks whose <code>WWWW</code> key
 component is 'sufficiently old' and whose content is already
 sufficiently sized and compressed. This will help to limit churn to
 only more recently dated chunks, which are smaller in size.</p>

 <h3>Reads</h3>

 <p>All columns are read together as a unit. Memcache is checked first,
 with reads falling back to the DHT if the cache does not have the
 chunk.</p>

 <p>Reasonably accurate prefetching is supported through background
 threads and prefetching metadata embedded in the <a
 href="#message_CachedPackInfo">CachedPackInfo</a> and <a
 href="#message_ChunkMeta">ChunkMeta</a> protocol messages used by
 readers.</p>

 <hr />
 <h2>Protocol Messages</h2>

 <pre>
 package git_store;
 option java_package = "org.eclipse.jgit.storage.dht.proto";


     // Entry in RefTable describing the target of the reference.
     // Either symref *OR* target must be populated, but never both.
     //
 <a name="message_RefData">message RefData</a> {
     // An ObjectId with an optional hint about where it can be found.
     //
   message Id {
     required string object_name = 1;
     optional string chunk_key = 2;
   }

     // Name of another reference this reference inherits its target
     // from.  The target is inherited on-the-fly at runtime by reading
     // the other reference.  Typically only "HEAD" uses symref.
     //
   optional string symref = 1;

     // ObjectId this reference currently points at.
     //
   optional Id target = 2;

     // True if the correct value for peeled is stored.
     //
   optional bool is_peeled = 3;

     // If is_peeled is true, this field is accurate.  This field
     // exists only if target points to annotated tag object, then
     // this field stores the "object" field for that tag.
     //
   optional Id peeled = 4;
 }


     // Entry in ObjectIndexTable, describes how an object appears in a chunk.
     //
 <a name="message_ObjectInfo">message ObjectInfo</a> {
     // Type of Git object.
     //
   enum ObjectType {
     COMMIT = 1;
     TREE = 2;
     BLOB = 3;
     TAG = 4;
   }
   optional ObjectType object_type = 1;

     // Position of the object's header within its chunk.
     //
   required int32 offset = 2;

     // Total number of compressed data bytes, not including the pack
     // header. For fragmented objects this is the sum of all chunks.
     //
   required int64 packed_size = 3;

     // Total number of bytes of the uncompressed object. For a
     // delta this is the size after applying the delta onto its base.
     //
   required int64 inflated_size = 4;

     // ObjectId of the delta base, if this object is stored as a delta.
     // The base is stored in raw binary.
     //
   optional bytes delta_base = 5;
 }


     // Describes at a high-level the information about a chunk.
     // A repository can use this summary to determine how much
     // data is stored, or when garbage collection should occur.
     //
 <a name="message_ChunkInfo">message ChunkInfo</a> {
     // Source of the chunk (what code path created it).
     //
   enum Source {
     RECEIVE = 1;    // Came in over the network from external source.
     INSERT = 2;     // Created in this repository (e.g. a merge).
     REPACK = 3;     // Generated during a repack of this repository.
   }
   optional Source source = 1;

     // Type of Git object stored in this chunk.
     //
   enum ObjectType {
     MIXED = 0;
     COMMIT = 1;
     TREE = 2;
     BLOB = 3;
     TAG = 4;
   }
   optional ObjectType object_type = 2;

     // True if this chunk is a member of a fragmented object.
     //
   optional bool is_fragment = 3;

     // If present, key of the CachedPackInfo object
     // that this chunk is a member of.
     //
   optional string cached_pack_key = 4;

     // Summary description of the objects stored here.
     //
   message ObjectCounts {
       // Number of objects stored in this chunk.
       //
     optional int32 total = 1;

       // Number of objects stored in whole (non-delta) form.
       //
     optional int32 whole = 2;

       // Number of objects stored in OFS_DELTA format.
       // The delta base appears in the same chunk, or
       // may appear in an earlier chunk through the
       // ChunkMeta.base_chunk link.
       //
     optional int32 ofs_delta = 3;

       // Number of objects stored in REF_DELTA format.
       // The delta base is at an unknown location.
       //
     optional int32 ref_delta = 4;
   }
   optional ObjectCounts object_counts = 5;

     // Size in bytes of the chunk's compressed data column.
     //
   optional int32 chunk_size = 6;

     // Size in bytes of the chunk's index.
     //
   optional int32 index_size = 7;

     // Size in bytes of the meta information.
     //
   optional int32 meta_size  = 8;
 }


     // Describes meta information about a chunk, stored inline with it.
     //
 <a name="message_ChunkMeta">message ChunkMeta</a> {
     // Enumerates the other chunks this chunk depends upon by OFS_DELTA.
     // Entries are sorted by relative_start ascending, enabling search.  Thus
     // the earliest chunk is at the end of the list.
     //
   message BaseChunk {
       // Bytes between start of the base chunk and start of this chunk.
       // Although the value is positive, its a negative offset.
       //
     required int64 relative_start = 1;
     required string chunk_key = 2;
   }
   repeated BaseChunk base_chunk = 1;

     // If this chunk is part of a fragment, key of every chunk that
     // makes up the fragment, including this chunk.
     //
   repeated string fragment = 2;

     // Chunks that should be prefetched if reading the current chunk.
     //
   message PrefetchHint {
     repeated string edge = 1;
     repeated string sequential = 2;
   }
   optional PrefetchHint commit_prefetch = 51;
   optional PrefetchHint tree_prefetch = 52;
 }


    // Describes a CachedPack, for efficient bulk clones.
     //
 <a name="message_CachedPackInfo">message CachedPackInfo</a> {
     // Unique name of the cached pack.  This is the SHA-1 hash of
     // all of the objects that make up the cached pack, sorted and
     // in binary form.  (Same rules as Git on the filesystem.)
     //
   required string name = 1;

     // SHA-1 of all chunk keys, which are themselves SHA-1s of the
     // raw chunk data. If any bit differs in compression (due to
     // repacking) the version will differ.
     //
   required string version = 2;

     // Total number of objects in the cached pack. This must be known
     // in order to set the final resulting pack header correctly before it
     // is sent to clients.
     //
   required int64 objects_total = 3;

     // Number of objects stored as deltas, rather than deflated whole.
     //
   optional int64 objects_delta = 4;

     // Total size of the chunks, in bytes, not including the chunk footer.
     //
   optional int64 bytes_total = 5;

     // Objects this pack starts from.
     //
   message TipObjectList {
     repeated string object_name = 1;
   }
   required TipObjectList tip_list = 6;

     // Chunks, in order of occurrence in the stream.
     //
   message ChunkList {
     repeated string chunk_key = 1;
   }
   required ChunkList chunk_list = 7;
 }
 </pre>

 </body>
 </html>