doc/design.md - zoekt - Git at Google



 OBJECTIVE
 =========

 Provide full text code search for git based corpuses.

 Goals:

 * sub-50ms results on large codebases, such as Android (~2G text) or
   Chrome

 * works well on a single standard Linux machine, with stable storage on SSD

 * search multiple repositories and multiple branches.

 * provide rich query language, with boolean operators

 * integrate with Gerrit/Gitiles code review/browsing system


 SEARCHING AND INDEXING
 ======================


 Positional trigrams
 -------------------

 We build an index of ngrams (n=3), where we store the offset of each
 ngram's occurrence within a file.  For example, if the corpus is "banana"
 then we generate the index

     "ban": 0
     "ana": 1,3
     "nan": 2

 If we are searching for a string (eg. "The quick brown fox"), then we
 look for two trigrams (eg. "The" and "fox"), and check that they are
 found at the right distance apart.

 Regular expressions are handled by extracting normal strings from the regular
 expressions. For example, to search for

     (Path|PathFragment).*=.*/usr/local

 we look for

     (AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local")

 and any documents thus found would be searched for the regular
 expression.

 Compared to indexing 3-grams on a per-file basis, as described
 [here](https://swtch.com/~rsc/regexp/regexp4.html), there are some advantages:

 * for each substring, we only have to intersect just a couple of posting-lists:
   one for the beginning, and one for the end.

 * Since we touch few posting lists per query, they can be stored on
   slower media, such as SSD.

 * we can select any pair of trigrams from the pattern for which the
   number of matches is minimal. For example, we could search for "qui"
   rather than "the".

 There are some downsides compared to trigrams:

 * The index is large. Empirically, it is about 3x the corpus size, composed of
   2x (offsets), and 1x (original content). However, since we have to look at
   just a limited number of ngrams, we don't have to keep the index in memory.

 Compared to [suffix
 arrays](https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/),
 there are the following advantages:

 * The index construction is straightforward, and can easily be made
   incremental.

 * Since the posting lists for a trigram can be stored on SSD,
   searching with positional trigrams only requires 1.2x corpus size of
   RAM.

 * All the matches are returned in document order. This makes it
   straightforward to process compound boolean queries with AND and OR.

 Downsides compared to suffix array:

 * there is no way to transform regular expressions into index ranges into
   the suffix array.


 Case sensitivity
 ----------------

 Code usually is searched without regard for case. In this case, when
 we are looking for "abc", we look for occurrences of all the different
 case variants, ie. {"abc", "Abc", "aBc", "ABc", ... }, and then
 compare the candidate matches without regard for case.


 UTF-8
 -----

 UTF-8 is the defacto encoding for unicode. Zoekt assumes that files
 are UTF-8 encoded.  Characters have differing widths in UTF-8, so we
 use rune offsets in the trigram index, and convert those back to bytes
 with a lookup table: every 100 runes, we store the rune-index to
 byte-index mapping. For corpuses that are completely ASCII (fairly
 normal for source code), we short-circuit this lookup.


 Branches
 --------

 Each file blob in the index has a bitmask, representing the branches
 in which the content is found, eg:

     branches: [master=1, staging=2, stable=4]
     file "x.java", branch mask=3
     file "x.java", branch mask=4

 in this case, the index holds two versions of "x.java", the one
 present in "master" and "staging", and the one in the "stable" branch.

 With this technique, we can index many similar branches of a
 repository with little space overhead.


 Index format
 ------------

 The index is organized in shards, where each shard is a file, laid out
 such that it can be mmap'd efficiently.

 Each shard contains data for one code repository. The basic data in an
 index shard are the following

    * file contents
    * filenames
    * the content posting lists (varint encoded)
    * the filename posting lists (varint encoded)
    * branch masks
    * metadata (repository name, index format version, etc.)

 In practice, the shard size is about 3x the corpus (size).

 The format uses uint32 for all offsets, so the total size of a shard
 should be below 4G. Given the size of the posting data, this caps
 content size per shard at 1G.

 Currently, within a shard, a single goroutine searches all documents,
 so the shard size determines the amount of parallelism, and large
 repositories should be split across multiple shards to achieve good
 performance.

 The metadata section contains a version number (which by convention is
 also part of the file name of the shard). This provides a smooth
 upgrade path across format versions: generate shards in the new
 format, kill old search service, start new search service, delete old
 shards.


 Ranking
 -------

 In absense of advanced signals (e.g. pagerank on symbol references),
 ranking options are limited: the following signals could be used for
 ranking

    * number of atoms matched
    * closeness to matches for other atoms
    * quality of match: does match boundary coincide with a word boundary?
    * file latest update time
    * filename lengh
    * tokenizer ranking: does a match fall comment or string literal?
    * symbol ranking: it the match a symbol definition?

 For the latter, it is necessary to find symbol definitions and other
 sections within files on indexing. Several (imperfect) programs to do
 this already exist, eg. `ctags`.


 Query language
 --------------

 Queries are stored as expression trees, using the following data
 structure:

     Query:
         Atom
         | AND QueryList
         | OR QueryList
         | NOT Query
         ;

     Atom:
         ConstQuery
         | SubStringQuery
         | RegexpQuery
         | RepoQuery
         | BranchQuery
         ;

 Both SubStringQuery and RegexpQuery can apply to either file or
 contents, and can optionally be case-insensitive.

 ConstQuery (match everything, or match nothing) is a useful construct
 for partial evaluation of a query: for each index shard through which
 we search, we partially evaluate the query, eg. when the query is

     and[substr:"needle" repo:"zoekt"]

 then we can rewrite the query to FALSE if we are looking at a shard
 for repository "bazel", skipping the entire shard.

 Each query must have at least one positive atom. Negations can only
 serve to prune results generated by positive atoms.


 Query parsing
 -------------

 Strings in the input language are considered regular expressions
 but literal regular expressions are simplified to Substring queries,

     a.*b => regexp:"a.*b"
     a\.b => substring:"a.b"

 leading modifiers select different types of atoms, eg.

     file:java => Substring_file:"java"
     branch:master => Repo:"master"

 parentheses inside a string (possibly with escaped spaces) are
 interpreted as regular expressions, otherwise they are used for grouping

     (abc def) => and[substring:"abc" substring:"def"]
     (abc\ def) => regexp:"(abc def)"

 there is an implicit "AND" between elements of a parenthesized list.
 There is an "OR" operator, which has lower priority than the implicit
 "AND":

     ppp qqq or rrr sss => or[and[substring:"ppp" substring:"qqq"] and[substring:"rrr" substring:"sss"]]


 GERRIT/GITILES INTEGRATION
 ==========================

 Gerrit is a popular system for code review on open source
 projects. Its sister project Gitiles provides a browsing experience.

 Any code search integration with Gerrit should be made available in
 Gitiles.  Gerrit/Gitiles has a complex ACL system, so a codesearch
 solution for Gerrit/Gitiles should respect these ACLs.

 Since Gitiles knows the identity of the logged-in user, it can
 construct search queries that respect ACLs, and even filter results
 afterwards if necessary. In such a setup, only Gitiles is allowed to
 talk to the search service, so it should be protected from general
 access, e.g. by requiring authentication.

 A codesearch implementation for Gitiles would change Gitiles to show a
 search box on pages relating to a repository. When searching, Gitiles
 would also render the search results. The process is as follows:

    * On receiving a query, Gitiles finds the list of branches visible to the user
    * Gitiles sends the raw query, along with branches and repository to the search service
    * The search service parses the query, and embeds it as follows

     (AND original-query repo:REPO (OR "branch:visible-1" "branch:visible-2" .. ))

    * The search service returns the search results, leaving it to
      gitiles to render them. Gitiles can apply any further filtering
      as necessary.


 SERVICE MANAGEMENT
 ==================

 The above details how indexing and searching works. A fully fledged
 system also crawls repositories and (re)indexes them. Since the system
 is designed to run on a single machine, we provide a service
 management tool, with the following responsibilities:

    * Poll git hosting sites (eg. github.com, googlesource.com), to fetch new updates
    * Reindex any changed repositories
    * Run the webserver; and restart if it goes down for any reason
    * Delete old webserver logs


 Security
 --------

 This section assumes that 'zoekt' is used as a public facing
 webserver, indexing publicly available data, serving on HTTPS without
 authentication.

 Since the UI is unauthenticated, there are no authentication secrets to steal.

 Since the code is public, there is no sensitive code to steal.

 This leaves us with the following senstive data:

    * Credentials for accesssing git repostitories (eg. github access token)
    * TLS server certificates
    * Query logs

 The system handles the following untrusted data:

    * code in git repositories
    * search queries

 Since 'zoekt' itself is written in Go, it does not have memory
 security problems: at worst, a bug in the query parser would lead to a
 crash.

 The code to index is handled by `ctags` for symbol detection. The
 security risk this poses is mitigated by using a seccomp based
 sandboxing.


 Privacy
 -------

 Webserver logs can contain privacy sensitive data (such as IP
 addresses and search queries). For this reason, the service management
 tool deletes them after a configurable period of time.


	OBJECTIVE
	=========

	Provide full text code search for git based corpuses.

	Goals:

	* sub-50ms results on large codebases, such as Android (~2G text) or
	Chrome

	* works well on a single standard Linux machine, with stable storage on SSD

	* search multiple repositories and multiple branches.

	* provide rich query language, with boolean operators

	* integrate with Gerrit/Gitiles code review/browsing system


	SEARCHING AND INDEXING
	======================


	Positional trigrams
	-------------------

	We build an index of ngrams (n=3), where we store the offset of each
	ngram's occurrence within a file. For example, if the corpus is "banana"
	then we generate the index

	"ban": 0
	"ana": 1,3
	"nan": 2

	If we are searching for a string (eg. "The quick brown fox"), then we
	look for two trigrams (eg. "The" and "fox"), and check that they are
	found at the right distance apart.

	Regular expressions are handled by extracting normal strings from the regular
	expressions. For example, to search for

	(Path\|PathFragment).=./usr/local

	we look for

	(AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local")

	and any documents thus found would be searched for the regular
	expression.

	Compared to indexing 3-grams on a per-file basis, as described
	[here](https://swtch.com/~rsc/regexp/regexp4.html), there are some advantages:

	* for each substring, we only have to intersect just a couple of posting-lists:
	one for the beginning, and one for the end.

	* Since we touch few posting lists per query, they can be stored on
	slower media, such as SSD.

	* we can select any pair of trigrams from the pattern for which the
	number of matches is minimal. For example, we could search for "qui"
	rather than "the".

	There are some downsides compared to trigrams:

	* The index is large. Empirically, it is about 3x the corpus size, composed of
	2x (offsets), and 1x (original content). However, since we have to look at
	just a limited number of ngrams, we don't have to keep the index in memory.

	Compared to [suffix
	arrays](https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/),
	there are the following advantages:

	* The index construction is straightforward, and can easily be made
	incremental.

	* Since the posting lists for a trigram can be stored on SSD,
	searching with positional trigrams only requires 1.2x corpus size of
	RAM.

	* All the matches are returned in document order. This makes it
	straightforward to process compound boolean queries with AND and OR.

	Downsides compared to suffix array:

	* there is no way to transform regular expressions into index ranges into
	the suffix array.


	Case sensitivity
	----------------

	Code usually is searched without regard for case. In this case, when
	we are looking for "abc", we look for occurrences of all the different
	case variants, ie. {"abc", "Abc", "aBc", "ABc", ... }, and then
	compare the candidate matches without regard for case.


	UTF-8
	-----

	UTF-8 is the defacto encoding for unicode. Zoekt assumes that files
	are UTF-8 encoded. Characters have differing widths in UTF-8, so we
	use rune offsets in the trigram index, and convert those back to bytes
	with a lookup table: every 100 runes, we store the rune-index to
	byte-index mapping. For corpuses that are completely ASCII (fairly
	normal for source code), we short-circuit this lookup.


	Branches
	--------

	Each file blob in the index has a bitmask, representing the branches
	in which the content is found, eg:

	branches: [master=1, staging=2, stable=4]
	file "x.java", branch mask=3
	file "x.java", branch mask=4

	in this case, the index holds two versions of "x.java", the one
	present in "master" and "staging", and the one in the "stable" branch.

	With this technique, we can index many similar branches of a
	repository with little space overhead.


	Index format
	------------

	The index is organized in shards, where each shard is a file, laid out
	such that it can be mmap'd efficiently.

	Each shard contains data for one code repository. The basic data in an
	index shard are the following

	* file contents
	* filenames
	* the content posting lists (varint encoded)
	* the filename posting lists (varint encoded)
	* branch masks
	* metadata (repository name, index format version, etc.)

	In practice, the shard size is about 3x the corpus (size).

	The format uses uint32 for all offsets, so the total size of a shard
	should be below 4G. Given the size of the posting data, this caps
	content size per shard at 1G.

	Currently, within a shard, a single goroutine searches all documents,
	so the shard size determines the amount of parallelism, and large
	repositories should be split across multiple shards to achieve good
	performance.

	The metadata section contains a version number (which by convention is
	also part of the file name of the shard). This provides a smooth
	upgrade path across format versions: generate shards in the new
	format, kill old search service, start new search service, delete old
	shards.


	Ranking
	-------

	In absense of advanced signals (e.g. pagerank on symbol references),
	ranking options are limited: the following signals could be used for
	ranking

	* number of atoms matched
	* closeness to matches for other atoms
	* quality of match: does match boundary coincide with a word boundary?
	* file latest update time
	* filename lengh
	* tokenizer ranking: does a match fall comment or string literal?
	* symbol ranking: it the match a symbol definition?

	For the latter, it is necessary to find symbol definitions and other
	sections within files on indexing. Several (imperfect) programs to do
	this already exist, eg. `ctags`.


	Query language
	--------------

	Queries are stored as expression trees, using the following data
	structure:

	Query:
	Atom
	\| AND QueryList
	\| OR QueryList
	\| NOT Query
	;

	Atom:
	ConstQuery
	\| SubStringQuery
	\| RegexpQuery
	\| RepoQuery
	\| BranchQuery
	;

	Both SubStringQuery and RegexpQuery can apply to either file or
	contents, and can optionally be case-insensitive.

	ConstQuery (match everything, or match nothing) is a useful construct
	for partial evaluation of a query: for each index shard through which
	we search, we partially evaluate the query, eg. when the query is

	and[substr:"needle" repo:"zoekt"]

	then we can rewrite the query to FALSE if we are looking at a shard
	for repository "bazel", skipping the entire shard.

	Each query must have at least one positive atom. Negations can only
	serve to prune results generated by positive atoms.


	Query parsing
	-------------

	Strings in the input language are considered regular expressions
	but literal regular expressions are simplified to Substring queries,

	a.b => regexp:"a.b"
	a\.b => substring:"a.b"

	leading modifiers select different types of atoms, eg.

	file:java => Substring_file:"java"
	branch:master => Repo:"master"

	parentheses inside a string (possibly with escaped spaces) are
	interpreted as regular expressions, otherwise they are used for grouping

	(abc def) => and[substring:"abc" substring:"def"]
	(abc\ def) => regexp:"(abc def)"

	there is an implicit "AND" between elements of a parenthesized list.
	There is an "OR" operator, which has lower priority than the implicit
	"AND":

	ppp qqq or rrr sss => or[and[substring:"ppp" substring:"qqq"] and[substring:"rrr" substring:"sss"]]


	GERRIT/GITILES INTEGRATION
	==========================

	Gerrit is a popular system for code review on open source
	projects. Its sister project Gitiles provides a browsing experience.

	Any code search integration with Gerrit should be made available in
	Gitiles. Gerrit/Gitiles has a complex ACL system, so a codesearch
	solution for Gerrit/Gitiles should respect these ACLs.

	Since Gitiles knows the identity of the logged-in user, it can
	construct search queries that respect ACLs, and even filter results
	afterwards if necessary. In such a setup, only Gitiles is allowed to
	talk to the search service, so it should be protected from general
	access, e.g. by requiring authentication.

	A codesearch implementation for Gitiles would change Gitiles to show a
	search box on pages relating to a repository. When searching, Gitiles
	would also render the search results. The process is as follows:

	* On receiving a query, Gitiles finds the list of branches visible to the user
	* Gitiles sends the raw query, along with branches and repository to the search service
	* The search service parses the query, and embeds it as follows

	(AND original-query repo:REPO (OR "branch:visible-1" "branch:visible-2" .. ))

	* The search service returns the search results, leaving it to
	gitiles to render them. Gitiles can apply any further filtering
	as necessary.


	SERVICE MANAGEMENT
	==================

	The above details how indexing and searching works. A fully fledged
	system also crawls repositories and (re)indexes them. Since the system
	is designed to run on a single machine, we provide a service
	management tool, with the following responsibilities:

	* Poll git hosting sites (eg. github.com, googlesource.com), to fetch new updates
	* Reindex any changed repositories
	* Run the webserver; and restart if it goes down for any reason
	* Delete old webserver logs


	Security
	--------

	This section assumes that 'zoekt' is used as a public facing
	webserver, indexing publicly available data, serving on HTTPS without
	authentication.

	Since the UI is unauthenticated, there are no authentication secrets to steal.

	Since the code is public, there is no sensitive code to steal.

	This leaves us with the following senstive data:

	* Credentials for accesssing git repostitories (eg. github access token)
	* TLS server certificates
	* Query logs

	The system handles the following untrusted data:

	* code in git repositories
	* search queries

	Since 'zoekt' itself is written in Go, it does not have memory
	security problems: at worst, a bug in the query parser would lead to a
	crash.

	The code to index is handled by `ctags` for symbol detection. The
	security risk this poses is mitigated by using a seccomp based
	sandboxing.


	Privacy
	-------

	Webserver logs can contain privacy sensitive data (such as IP
	addresses and search queries). For this reason, the service management
	tool deletes them after a configurable period of time.