OBJECTIVE

Provide full text code search for git based corpuses.

Goals:

sub-50ms results on large codebases, such as Android (~2G text) or Chrome
works well on a single standard Linux machine, with stable storage on SSD
search multiple repositories and multiple branches.
provide rich query language, with boolean operators
integrate with Gerrit/Gitiles code review/browsing system

SEARCHING AND INDEXING

Positional trigrams

We build an index of ngrams (n=3), where we store the offset of each ngram's occurrence within a file. For example, if the corpus is “banana” then we generate the index

"ban": 0
"ana": 1,3
"nan": 2

If we are searching for a string (eg. “The quick brown fox”), then we look for two trigrams (eg. “The” and “fox”), and check that they are found at the right distance apart.

Regular expressions are handled by extracting normal strings from the regular expressions. For example, to search for

(Path|PathFragment).*=.*/usr/local

we look for

(AND (OR substr:"Path" substr:"PathFragment") substr:"/usr/local")

and any documents thus found would be searched for the regular expression.

Compared to indexing 3-grams on a per-file basis, as described here, there are some advantages:

for each substring, we only have to intersect just a couple of posting-lists: one for the beginning, and one for the end.
Since we touch few posting lists per query, they can be stored on slower media, such as SSD.
we can select any pair of trigrams from the pattern for which the number of matches is minimal. For example, we could search for “qui” rather than “the”.

There are some downsides compared to trigrams:

The index is large. Empirically, it is about 3x the corpus size, composed of 2x (offsets), and 1x (original content). However, since we have to look at just a limited number of ngrams, we don't have to keep the index in memory.

Compared to suffix arrays, there are the following advantages:

The index construction is straightforward, and can easily be made incremental.
Since the posting lists for a trigram can be stored on SSD, searching with positional trigrams only requires 1.2x corpus size of RAM.
All the matches are returned in document order. This makes it straightforward to process compound boolean queries with AND and OR.

Downsides compared to suffix array:

there is no way to transform regular expressions into index ranges into the suffix array.

Case sensitivity

Code usually is searched without regard for case. In this case, when we are looking for “abc”, we look for occurrences of all the different case variants, ie. {“abc”, “Abc”, “aBc”, “ABc”, ... }, and then compare the candidate matches without regard for case.

UTF-8

UTF-8 is the defacto encoding for unicode. Zoekt assumes that files are UTF-8 encoded. Characters have differing widths in UTF-8, so we use rune offsets in the trigram index, and convert those back to bytes with a lookup table: every 100 runes, we store the rune-index to byte-index mapping. For corpuses that are completely ASCII (fairly normal for source code), we short-circuit this lookup.

Branches

Each file blob in the index has a bitmask, representing the branches in which the content is found, eg:

branches: [master=1, staging=2, stable=4]
file "x.java", branch mask=3
file "x.java", branch mask=4

in this case, the index holds two versions of “x.java”, the one present in “master” and “staging”, and the one in the “stable” branch.

With this technique, we can index many similar branches of a repository with little space overhead.

Index format

The index is organized in shards, where each shard is a file, laid out such that it can be mmap'd efficiently.

Each shard contains data for one code repository. The basic data in an index shard are the following

file contents
filenames
the content posting lists (varint encoded)
the filename posting lists (varint encoded)
branch masks
metadata (repository name, index format version, etc.)

In practice, the shard size is about 3x the corpus (size).

The format uses uint32 for all offsets, so the total size of a shard should be below 4G. Given the size of the posting data, this caps content size per shard at 1G.

Currently, within a shard, a single goroutine searches all documents, so the shard size determines the amount of parallelism, and large repositories should be split across multiple shards to achieve good performance.

The metadata section contains a version number (which by convention is also part of the file name of the shard). This provides a smooth upgrade path across format versions: generate shards in the new format, kill old search service, start new search service, delete old shards.

Ranking

In absense of advanced signals (e.g. pagerank on symbol references), ranking options are limited: the following signals could be used for ranking

number of atoms matched
closeness to matches for other atoms
quality of match: does match boundary coincide with a word boundary?
file latest update time
filename lengh
tokenizer ranking: does a match fall comment or string literal?
symbol ranking: it the match a symbol definition?

For the latter, it is necessary to find symbol definitions and other sections within files on indexing. Several (imperfect) programs to do this already exist, eg. ctags.

Query language

Queries are stored as expression trees, using the following data structure:

Query:
    Atom
    | AND QueryList
    | OR QueryList
    | NOT Query
    ;

Atom:
    ConstQuery
    | SubStringQuery
    | RegexpQuery
    | RepoQuery
    | BranchQuery
    ;

Both SubStringQuery and RegexpQuery can apply to either file or contents, and can optionally be case-insensitive.

ConstQuery (match everything, or match nothing) is a useful construct for partial evaluation of a query: for each index shard through which we search, we partially evaluate the query, eg. when the query is

and[substr:"needle" repo:"zoekt"]

then we can rewrite the query to FALSE if we are looking at a shard for repository “bazel”, skipping the entire shard.

Each query must have at least one positive atom. Negations can only serve to prune results generated by positive atoms.

Query parsing

Strings in the input language are considered regular expressions but literal regular expressions are simplified to Substring queries,

a.*b => regexp:"a.*b"
a\.b => substring:"a.b"

leading modifiers select different types of atoms, eg.

file:java => Substring_file:"java"
branch:master => Repo:"master"

parentheses inside a string (possibly with escaped spaces) are interpreted as regular expressions, otherwise they are used for grouping

(abc def) => and[substring:"abc" substring:"def"]
(abc\ def) => regexp:"(abc def)"

there is an implicit “AND” between elements of a parenthesized list. There is an “OR” operator, which has lower priority than the implicit “AND”:

ppp qqq or rrr sss => or[and[substring:"ppp" substring:"qqq"] and[substring:"rrr" substring:"sss"]]

GERRIT/GITILES INTEGRATION

Gerrit is a popular system for code review on open source projects. Its sister project Gitiles provides a browsing experience.

Any code search integration with Gerrit should be made available in Gitiles. Gerrit/Gitiles has a complex ACL system, so a codesearch solution for Gerrit/Gitiles should respect these ACLs.

Since Gitiles knows the identity of the logged-in user, it can construct search queries that respect ACLs, and even filter results afterwards if necessary. In such a setup, only Gitiles is allowed to talk to the search service, so it should be protected from general access, e.g. by requiring authentication.

A codesearch implementation for Gitiles would change Gitiles to show a search box on pages relating to a repository. When searching, Gitiles would also render the search results. The process is as follows:

On receiving a query, Gitiles finds the list of branches visible to the user
Gitiles sends the raw query, along with branches and repository to the search service
The search service parses the query, and embeds it as follows

(AND original-query repo:REPO (OR "branch:visible-1" "branch:visible-2" .. ))

The search service returns the search results, leaving it to gitiles to render them. Gitiles can apply any further filtering as necessary.

SERVICE MANAGEMENT

The above details how indexing and searching works. A fully fledged system also crawls repositories and (re)indexes them. Since the system is designed to run on a single machine, we provide a service management tool, with the following responsibilities:

Poll git hosting sites (eg. github.com, googlesource.com), to fetch new updates
Reindex any changed repositories
Run the webserver; and restart if it goes down for any reason
Delete old webserver logs

Security

This section assumes that ‘zoekt’ is used as a public facing webserver, indexing publicly available data, serving on HTTPS without authentication.

Since the UI is unauthenticated, there are no authentication secrets to steal.

Since the code is public, there is no sensitive code to steal.

This leaves us with the following senstive data:

Credentials for accesssing git repostitories (eg. github access token)
TLS server certificates
Query logs

The system handles the following untrusted data:

code in git repositories
search queries

Since ‘zoekt’ itself is written in Go, it does not have memory security problems: at worst, a bug in the query parser would lead to a crash.

The code to index is handled by ctags for symbol detection. The security risk this poses is mitigated by using a seccomp based sandboxing.

Privacy

Webserver logs can contain privacy sensitive data (such as IP addresses and search queries). For this reason, the service management tool deletes them after a configurable period of time.