docs/Notedb.md - homepage-test - Git at Google

 ### What is notedb?

 Notedb is the successor to ReviewDb: a replacement database backend for Gerrit.
 The goal is to read code review metadata from the same set of repositories that
 store the data. This allows for improved atomicity, consistency of replication,
 and the creation of new features like federated review and offline review.

 This document describes the state of migration (not necessarily completely up to
 date), the tasks that remain, and notes on some of the challenges we've
 encountered along the way. This document is **not** a full design document for
 notedb; if you're that curious, bug dborowitz and he will help you out.

 Finally, this document is for core developers. If you are a casual user of
 Gerrit looking for documentation, you've come to the wrong place.

 ## Root Tables

 While ReviewDb has a lot of tables, there are relatively few "root" tables, that
 is, tables whose primary key's `get!ParentKey()` method returns null:

 * **Change**: subtables ChangeMessage, PatchSet, PatchSetApproval,
   PatchSetAncestor, PatchLineComment, TrackingId
 * **Account**: subtables AccountExternalId, AccountProjectWatch,
   AccountSshKey, AccountDiffPreference, StarredChange
 * **AccountGroup**: subtables AccountGroupMember

 TODO(dborowitz): Document other minor tables, audits, etc.

 For each root entity in each of these tables, there is one DAG describing all
 modifications that have been applied to the change over time. Entities DAGs are
 stored in a repository corresponding to their function:

 * **Change**: stored in `refs/changes/YZ/XYZ/meta` in the destination repository
 * **Account**: stored in `refs/accounts/YZ/XYZ/meta` (TBD) in `All-Users`
 * **AccountGroup**: stored in `TBD` in `All-Projects`

 Most of this document focuses on Change entities, partly because it's the most
 complex, but also because that's where most effort to date has been focused.

 ## Changes: What's Done

 Current progress, along with some possibly-interesting implementation notes.

 *   ChangeMessages: Stored in commit message body. Currently the subject of the
     commit message contains a machine-generated-but-readable summary like
     "Updated patch set 3"; we might decide to eliminate this and just use the
     ChangeMessage.
 *   PatchSetApprovals: Stored as a footer "Label: Label-Name=Foo". Instead of
     storing an implicit 0 review for reviewers, include them explicitly in a
     separate Reviewer footer. Freeze labels at submit time along with the full
     submit rule evaluator results using "Submitted-with".
 *   PatchLineComments: The only thing thus far actually stored in a note, on the
     commit to which it applies. Drafts are stored in a per-user ref in
     All-Users.
 *   TrackingId: Killed this long ago and use the secondary index. (Just reminded
     myself I need to rip out the gwtorm entity.)
 *   Change: Started storing obvious fields: owner, project, branch, topic.
 *   PatchSet: Storing IDs and created on timestamps right now, not reading from
     it yet.

 ## Changes: What Needs to Be Done

 *   PatchSetAncestors: Replace with a persistent cache, which should probably be
     rebuilt in RebuildNotedb.
 *   PatchSet: Draft field. (Someday I think we should replace Draft with WIP,
     but I digress.)
 *   Change: Kill more fields. Actually implement reading from changes; see
     challenges section.
 *   Some sort of batch parser. If we get 100 changes back from a search,
     sequentially reading the DAG for each of those might take a while.
 *   Benchmark and optimize the heck out of the parsers. Let's say a target is
     1ms per change DAG. Related, we may also need to disable (eager) parsing of
     certain fields, if we can prove with benchmarks that they are problematic.
     (We already do this for PatchLineComments, to avoid having to read anything
     but commits in the common case.)

 ## JGit Changes

 *   Teach JGit to pack notedb refs in a separate pack file from everything else.
     We don't want to hurt locality within the main pack by interleaving metadata
     commits.
 *   Teach JGit to better handle many small, separate commit graphs in the
     packer. Ordered chronologically, notedb metadata will be spread across a
     large number of separate DAGs. We will get better kernel buffer cache
     locality by clustering all commits in each disconnect DAG together. (But
     this may also hurt batch parsing; see above.)

 ## Challenges

 TODO
	### What is notedb?

	Notedb is the successor to ReviewDb: a replacement database backend for Gerrit.
	The goal is to read code review metadata from the same set of repositories that
	store the data. This allows for improved atomicity, consistency of replication,
	and the creation of new features like federated review and offline review.

	This document describes the state of migration (not necessarily completely up to
	date), the tasks that remain, and notes on some of the challenges we've
	encountered along the way. This document is not a full design document for
	notedb; if you're that curious, bug dborowitz and he will help you out.

	Finally, this document is for core developers. If you are a casual user of
	Gerrit looking for documentation, you've come to the wrong place.

	## Root Tables

	While ReviewDb has a lot of tables, there are relatively few "root" tables, that
	is, tables whose primary key's `get!ParentKey()` method returns null:

	* Change: subtables ChangeMessage, PatchSet, PatchSetApproval,
	PatchSetAncestor, PatchLineComment, TrackingId
	* Account: subtables AccountExternalId, AccountProjectWatch,
	AccountSshKey, AccountDiffPreference, StarredChange
	* AccountGroup: subtables AccountGroupMember

	TODO(dborowitz): Document other minor tables, audits, etc.

	For each root entity in each of these tables, there is one DAG describing all
	modifications that have been applied to the change over time. Entities DAGs are
	stored in a repository corresponding to their function:

	* Change: stored in `refs/changes/YZ/XYZ/meta` in the destination repository
	* Account: stored in `refs/accounts/YZ/XYZ/meta` (TBD) in `All-Users`
	* AccountGroup: stored in `TBD` in `All-Projects`

	Most of this document focuses on Change entities, partly because it's the most
	complex, but also because that's where most effort to date has been focused.

	## Changes: What's Done

	Current progress, along with some possibly-interesting implementation notes.

	* ChangeMessages: Stored in commit message body. Currently the subject of the
	commit message contains a machine-generated-but-readable summary like
	"Updated patch set 3"; we might decide to eliminate this and just use the
	ChangeMessage.
	* PatchSetApprovals: Stored as a footer "Label: Label-Name=Foo". Instead of
	storing an implicit 0 review for reviewers, include them explicitly in a
	separate Reviewer footer. Freeze labels at submit time along with the full
	submit rule evaluator results using "Submitted-with".
	* PatchLineComments: The only thing thus far actually stored in a note, on the
	commit to which it applies. Drafts are stored in a per-user ref in
	All-Users.
	* TrackingId: Killed this long ago and use the secondary index. (Just reminded
	myself I need to rip out the gwtorm entity.)
	* Change: Started storing obvious fields: owner, project, branch, topic.
	* PatchSet: Storing IDs and created on timestamps right now, not reading from
	it yet.

	## Changes: What Needs to Be Done

	* PatchSetAncestors: Replace with a persistent cache, which should probably be
	rebuilt in RebuildNotedb.
	* PatchSet: Draft field. (Someday I think we should replace Draft with WIP,
	but I digress.)
	* Change: Kill more fields. Actually implement reading from changes; see
	challenges section.
	* Some sort of batch parser. If we get 100 changes back from a search,
	sequentially reading the DAG for each of those might take a while.
	* Benchmark and optimize the heck out of the parsers. Let's say a target is
	1ms per change DAG. Related, we may also need to disable (eager) parsing of
	certain fields, if we can prove with benchmarks that they are problematic.
	(We already do this for PatchLineComments, to avoid having to read anything
	but commits in the common case.)

	## JGit Changes

	* Teach JGit to pack notedb refs in a separate pack file from everything else.
	We don't want to hurt locality within the main pack by interleaving metadata
	commits.
	* Teach JGit to better handle many small, separate commit graphs in the
	packer. Ordered chronologically, notedb metadata will be spread across a
	large number of separate DAGs. We will get better kernel buffer cache
	locality by clustering all commits in each disconnect DAG together. (But
	this may also hurt batch parsing; see above.)

	## Challenges

	TODO