blob: df3bfeaae313de375fafc9f608f1b4053287ee69 [file] [log] [blame] [view]
OBJECTIVE
=========
Provide full text code search for git based corpuses.
Goals:
* sub-50ms results on large codebases, such as Android (~2G text) or
Chrome
* works well on a single standard Linux machine, with stable storage on SSD
* search multiple repositories and multiple branches.
* provide rich query language, with boolean operators
* integrate with Gerrit/Gitiles code review/browsing system
SEARCHING AND INDEXING
======================
Positional trigrams
-------------------
We build an index of ngrams (n=3), where we store the offset of each
ngram's occurrence within a file. For example, if the corpus is "banana"
then we generate the index
"ban": 0
"ana": 1,3
"nan": 2
If we are searching for a string (eg. "The quick brown fox"), then we
look for two trigrams (eg. "The" and "fox"), and check that they are
found at the right distance apart.
Regular expressions are handled by extracting normal strings from the regular
expressions. For example, to search for
(Path|PathFragment).*=.*/usr/local
we look for
(AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local")
and any documents thus found would be searched for the regular
expression.
Compared to indexing 3-grams on a per-file basis, as described
[here](https://swtch.com/~rsc/regexp/regexp4.html), there are some advantages:
* for each substring, we only have to intersect just a couple of posting-lists:
one for the beginning, and one for the end.
* Since we touch few posting lists per query, they can be stored on
slower media, such as SSD.
* we can select any pair of trigrams from the pattern for which the
number of matches is minimal. For example, we could search for "qui"
rather than "the".
There are some downsides compared to trigrams:
* The index is large. Empirically, it is about 3x the corpus size, composed of
2x (offsets), and 1x (original content). However, since we have to look at
just a limited number of ngrams, we don't have to keep the index in memory.
Compared to [suffix
arrays](https://blog.nelhage.com/2015/02/regular-expression-search-with-suffix-arrays/),
there are the following advantages:
* The index construction is straightforward, and can easily be made
incremental.
* Since the posting lists for a trigram can be stored on SSD,
searching with positional trigrams only requires 1.2x corpus size of
RAM.
* All the matches are returned in document order. This makes it
straightforward to process compound boolean queries with AND and OR.
Downsides compared to suffix array:
* there is no way to transform regular expressions into index ranges into
the suffix array.
Case sensitivity
----------------
Code usually is searched without regard for case. In this case, when
we are looking for "abc", we look for occurrences of all the different
case variants, ie. {"abc", "Abc", "aBc", "ABc", ... }, and then
compare the candidate matches without regard for case.
UTF-8
-----
UTF-8 is the defacto encoding for unicode. Zoekt assumes that files
are UTF-8 encoded. Characters have differing widths in UTF-8, so we
use rune offsets in the trigram index, and convert those back to bytes
with a lookup table: every 100 runes, we store the rune-index to
byte-index mapping. For corpuses that are completely ASCII (fairly
normal for source code), we short-circuit this lookup.
Branches
--------
Each file blob in the index has a bitmask, representing the branches
in which the content is found, eg:
branches: [master=1, staging=2, stable=4]
file "x.java", branch mask=3
file "x.java", branch mask=4
in this case, the index holds two versions of "x.java", the one
present in "master" and "staging", and the one in the "stable" branch.
With this technique, we can index many similar branches of a
repository with little space overhead.
Index format
------------
The index is organized in shards, where each shard is a file, laid out
such that it can be mmap'd efficiently.
Each shard contains data for one code repository. The basic data in an
index shard are the following
* file contents
* filenames
* the content posting lists (varint encoded)
* the filename posting lists (varint encoded)
* branch masks
* metadata (repository name, index format version, etc.)
In practice, the shard size is about 3x the corpus (size).
The format uses uint32 for all offsets, so the total size of a shard
should be below 4G. Given the size of the posting data, this caps
content size per shard at 1G.
Currently, within a shard, a single goroutine searches all documents,
so the shard size determines the amount of parallelism, and large
repositories should be split across multiple shards to achieve good
performance.
The metadata section contains a version number (which by convention is
also part of the file name of the shard). This provides a smooth
upgrade path across format versions: generate shards in the new
format, kill old search service, start new search service, delete old
shards.
Ranking
-------
In absense of advanced signals (e.g. pagerank on symbol references),
ranking options are limited: the following signals could be used for
ranking
* number of atoms matched
* closeness to matches for other atoms
* quality of match: does match boundary coincide with a word boundary?
* file latest update time
* filename lengh
* tokenizer ranking: does a match fall comment or string literal?
* symbol ranking: it the match a symbol definition?
For the latter, it is necessary to find symbol definitions and other
sections within files on indexing. Several (imperfect) programs to do
this already exist, eg. `ctags`.
Query language
--------------
Queries are stored as expression trees, using the following data
structure:
Query:
Atom
| AND QueryList
| OR QueryList
| NOT Query
;
Atom:
ConstQuery
| SubStringQuery
| RegexpQuery
| RepoQuery
| BranchQuery
;
Both SubStringQuery and RegexpQuery can apply to either file or
contents, and can optionally be case-insensitive.
ConstQuery (match everything, or match nothing) is a useful construct
for partial evaluation of a query: for each index shard through which
we search, we partially evaluate the query, eg. when the query is
and[substr:"needle" repo:"zoekt"]
then we can rewrite the query to FALSE if we are looking at a shard
for repository "bazel", skipping the entire shard.
Each query must have at least one positive atom. Negations can only
serve to prune results generated by positive atoms.
Query parsing
-------------
Strings in the input language are considered regular expressions
but literal regular expressions are simplified to Substring queries,
a.*b => regexp:"a.*b"
a\.b => substring:"a.b"
leading modifiers select different types of atoms, eg.
file:java => Substring_file:"java"
branch:master => Repo:"master"
parentheses inside a string (possibly with escaped spaces) are
interpreted as regular expressions, otherwise they are used for grouping
(abc def) => and[substring:"abc" substring:"def"]
(abc\ def) => regexp:"(abc def)"
there is an implicit "AND" between elements of a parenthesized list.
There is an "OR" operator, which has lower priority than the implicit
"AND":
ppp qqq or rrr sss => or[and[substring:"ppp" substring:"qqq"] and[substring:"rrr" substring:"sss"]]
GERRIT/GITILES INTEGRATION
==========================
Gerrit is a popular system for code review on open source
projects. Its sister project Gitiles provides a browsing experience.
Any code search integration with Gerrit should be made available in
Gitiles. Gerrit/Gitiles has a complex ACL system, so a codesearch
solution for Gerrit/Gitiles should respect these ACLs.
Since Gitiles knows the identity of the logged-in user, it can
construct search queries that respect ACLs, and even filter results
afterwards if necessary. In such a setup, only Gitiles is allowed to
talk to the search service, so it should be protected from general
access, e.g. by requiring authentication.
A codesearch implementation for Gitiles would change Gitiles to show a
search box on pages relating to a repository. When searching, Gitiles
would also render the search results. The process is as follows:
* On receiving a query, Gitiles finds the list of branches visible to the user
* Gitiles sends the raw query, along with branches and repository to the search service
* The search service parses the query, and embeds it as follows
(AND original-query repo:REPO (OR "branch:visible-1" "branch:visible-2" .. ))
* The search service returns the search results, leaving it to
gitiles to render them. Gitiles can apply any further filtering
as necessary.
SERVICE MANAGEMENT
==================
The above details how indexing and searching works. A fully fledged
system also crawls repositories and (re)indexes them. Since the system
is designed to run on a single machine, we provide a service
management tool, with the following responsibilities:
* Poll git hosting sites (eg. github.com, googlesource.com), to fetch new updates
* Reindex any changed repositories
* Run the webserver; and restart if it goes down for any reason
* Delete old webserver logs
Security
--------
This section assumes that 'zoekt' is used as a public facing
webserver, indexing publicly available data, serving on HTTPS without
authentication.
Since the UI is unauthenticated, there are no authentication secrets to steal.
Since the code is public, there is no sensitive code to steal.
This leaves us with the following senstive data:
* Credentials for accesssing git repostitories (eg. github access token)
* TLS server certificates
* Query logs
The system handles the following untrusted data:
* code in git repositories
* search queries
Since 'zoekt' itself is written in Go, it does not have memory
security problems: at worst, a bug in the query parser would lead to a
crash.
The code to index is handled by `ctags` for symbol detection. The
security risk this poses is mitigated by using a seccomp based
sandboxing.
Privacy
-------
Webserver logs can contain privacy sensitive data (such as IP
addresses and search queries). For this reason, the service management
tool deletes them after a configurable period of time.