Provide full text code search for git based corpuses.
Goals:
sub-50ms results on large codebases, such as Android (~2G text) or Chrome
works well on a single standard Linux machine, with stable storage on SSD
search multiple repositories and multiple branches.
provide rich query language, with boolean operators
integrate with Gerrit/Gitiles code review/browsing system
We build an index of ngrams (n=3), where we store the offset of each ngram's occurrence within a file. For example, if the corpus is “banana” then we generate the index
"ban": 0 "ana": 1,3 "nan": 2
If we are searching for a string (eg. “The quick brown fox”), then we look for two trigrams (eg. “The” and “fox”), and check that they are found at the right distance apart.
Regular expressions are handled by extracting normal strings from the regular expressions. For example, to search for
(Path|PathFragment).*=.*/usr/local
we look for
(AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local")
and any documents thus found would be searched for the regular expression.
Compared to indexing 3-grams on a per-file basis, as described here, there are some advantages:
for each substring, we only have to intersect just a couple of posting-lists: one for the beginning, and one for the end.
Since we touch few posting lists per query, they can be stored on slower media, such as SSD.
we can select any pair of trigrams from the pattern for which the number of matches is minimal. For example, we could search for “qui” rather than “the”.
There are some downsides compared to trigrams:
Compared to suffix arrays, there are the following advantages:
The index construction is straightforward, and can easily be made incremental.
Since the posting lists for a trigram can be stored on SSD, searching with positional trigrams only requires 1.2x corpus size of RAM.
All the matches are returned in document order. This makes it straightforward to process compound boolean queries with AND and OR.
Downsides compared to suffix array:
Code usually is searched without regard for case. In this case, when we are looking for “abc”, we look for occurrences of all the different case variants, ie. {“abc”, “Abc”, “aBc”, “ABc”, ... }, and then compare the candidate matches without regard for case.
UTF-8 is the defacto encoding for unicode. Zoekt assumes that files are UTF-8 encoded. Characters have differing widths in UTF-8, so we use rune offsets in the trigram index, and convert those back to bytes with a lookup table: every 100 runes, we store the rune-index to byte-index mapping. For corpuses that are completely ASCII (fairly normal for source code), we short-circuit this lookup.
Each file blob in the index has a bitmask, representing the branches in which the content is found, eg:
branches: [master=1, staging=2, stable=4] file "x.java", branch mask=3 file "x.java", branch mask=4
in this case, the index holds two versions of “x.java”, the one present in “master” and “staging”, and the one in the “stable” branch.
With this technique, we can index many similar branches of a repository with little space overhead.
The index is organized in shards, where each shard is a file, laid out such that it can be mmap'd efficiently.
Each shard contains data for one code repository. The basic data in an index shard are the following
In practice, the shard size is about 3x the corpus (size).
The format uses uint32 for all offsets, so the total size of a shard should be below 4G. Given the size of the posting data, this caps content size per shard at 1G.
Currently, within a shard, a single goroutine searches all documents, so the shard size determines the amount of parallelism, and large repositories should be split across multiple shards to achieve good performance.
The metadata section contains a version number (which by convention is also part of the file name of the shard). This provides a smooth upgrade path across format versions: generate shards in the new format, kill old search service, start new search service, delete old shards.
In absense of advanced signals (e.g. pagerank on symbol references), ranking options are limited: the following signals could be used for ranking
For the latter, it is necessary to find symbol definitions and other sections within files on indexing. Several (imperfect) programs to do this already exist, eg. ctags
.
Queries are stored as expression trees, using the following data structure:
Query: Atom | AND QueryList | OR QueryList | NOT Query ; Atom: ConstQuery | SubStringQuery | RegexpQuery | RepoQuery | BranchQuery ;
Both SubStringQuery and RegexpQuery can apply to either file or contents, and can optionally be case-insensitive.
ConstQuery (match everything, or match nothing) is a useful construct for partial evaluation of a query: for each index shard through which we search, we partially evaluate the query, eg. when the query is
and[substr:"needle" repo:"zoekt"]
then we can rewrite the query to FALSE if we are looking at a shard for repository “bazel”, skipping the entire shard.
Each query must have at least one positive atom. Negations can only serve to prune results generated by positive atoms.
Strings in the input language are considered regular expressions but literal regular expressions are simplified to Substring queries,
a.*b => regexp:"a.*b" a\.b => substring:"a.b"
leading modifiers select different types of atoms, eg.
file:java => Substring_file:"java" branch:master => Repo:"master"
parentheses inside a string (possibly with escaped spaces) are interpreted as regular expressions, otherwise they are used for grouping
(abc def) => and[substring:"abc" substring:"def"] (abc\ def) => regexp:"(abc def)"
there is an implicit “AND” between elements of a parenthesized list. There is an “OR” operator, which has lower priority than the implicit “AND”:
ppp qqq or rrr sss => or[and[substring:"ppp" substring:"qqq"] and[substring:"rrr" substring:"sss"]]
Gerrit is a popular system for code review on open source projects. Its sister project Gitiles provides a browsing experience.
Any code search integration with Gerrit should be made available in Gitiles. Gerrit/Gitiles has a complex ACL system, so a codesearch solution for Gerrit/Gitiles should respect these ACLs.
Since Gitiles knows the identity of the logged-in user, it can construct search queries that respect ACLs, and even filter results afterwards if necessary. In such a setup, only Gitiles is allowed to talk to the search service, so it should be protected from general access, e.g. by requiring authentication.
A codesearch implementation for Gitiles would change Gitiles to show a search box on pages relating to a repository. When searching, Gitiles would also render the search results. The process is as follows:
(AND original-query repo:REPO (OR "branch:visible-1" "branch:visible-2" .. ))
The above details how indexing and searching works. A fully fledged system also crawls repositories and (re)indexes them. Since the system is designed to run on a single machine, we provide a service management tool, with the following responsibilities:
This section assumes that ‘zoekt’ is used as a public facing webserver, indexing publicly available data, serving on HTTPS without authentication.
Since the UI is unauthenticated, there are no authentication secrets to steal.
Since the code is public, there is no sensitive code to steal.
This leaves us with the following senstive data:
The system handles the following untrusted data:
Since ‘zoekt’ itself is written in Go, it does not have memory security problems: at worst, a bug in the query parser would lead to a crash.
The code to index is handled by ctags
for symbol detection. The security risk this poses is mitigated by using a seccomp based sandboxing.
Webserver logs can contain privacy sensitive data (such as IP addresses and search queries). For this reason, the service management tool deletes them after a configurable period of time.