Software engineering is more about reading than writing code, and part of this process is finding the code that you should read. If you are working on a large project, then finding source code through navigation quickly becomes inefficient.
Search engines let you find interesting code much faster than browsing code, in much the same way that search engines speed up finding things on the internet.
I had to implement SSH hashed hostkey checking on a whim recently, and here is how I quickly zoomed into the relevant code using our public zoekt instance:
hash host ssh: more than 20k results in 750 files, in 3 seconds
hash host r:openssh: 6k results in 114 files, in 20ms
hash host r:openssh known_host: 4k result in 42 files, in 13ms
the last query still yielded a substantial number of results, but the function hash_host
that I was looking for was the 3rd result from the first file.
Often, you don't know exactly what you are looking for, until you found it. Code search is effective because you can formulate an approximate query, and then refine it based on results you got. For this to work, you need the following features:
Coverage: the code that interests you should be available for searching
Speed: search should return useful results quickly (sub-second), so you can iterate on queries
Approximate queries: matching should be done case insensitively, on arbitrary substrings, so we don't have to know what we are looking for in advance.
Filtering: we can winnow down results by composing more specific queries
Ranking: interesting results (eg. function definitions, whole word matches) should be at the top.
zoekt
provide for these?Coverage: zoekt
comes with tools to mirror parts of common Git hosting sites. cs.bazel.build
uses this to index most of the Google authored open source software on github.com and googlesource.com.
Speed: zoekt
uses an index based on positional trigrams. For rare strings, eg. nienhuys
, this typically yields results in ~10ms if the operating system caches are warm.
Approximate queries: zoekt
supports substring patterns and regular expressions, and can do case-insensitive matching on UTF-8 text.
Filtering: you can filter query by adding extra atoms (eg. f:\.go$
limits to Go source code), and filter out terms with -
, so \blinus\b -torvalds
finds the Linuses other than Linus Torvalds.
Ranking: zoekt uses ctags to find declarations, and these are boosted in the search ranking.
grep -r
?Grep lets you find arbitrary substrings, but it doesn't scale to large corpuses, and lacks filtering and ranking.
If your project fits into your IDE, than that is great. Unfortunately, loading projects into IDEs is slow, cumbersome, and not supported by all projects.
github.com
?Github‘s search has great coverage, but unfortunately, its search functionality doesn’t support arbitrary substrings. For example, a query for part of my surname does not turn up anything (except this document), while my complete name does.
Etsy/hound is a code search engine which supports regular expressions over large corpuses, it is about 10x slower than zoekt. However, there is only rudimentary support for filtering, and there is no symbol ranking.
livegrep is a code search engine which supports regular expressions over large corpuses. However, due to its indexing technique, it requires a lot of RAM and CPU. There is only rudimentary support for filtering, and there is no symbol ranking.
zoekt
require?The search server should have local SSD to store the index file (which is 3.5x the corpus size), and have at least 20% more RAM than the corpus size.
Yes. You can index 64 branches (see also https://github.com/google/zoekt/issues/32). Files that are identical across branches take up space just once in the index.
Rare strings, are extremely fast to retrieve, for example r:torvalds crazy
(search “crazy” in the linux kernel) typically takes about 7-10ms on cs.bazel.build.
The speed for common strings is dominated by how many results you want to see. For example [r:torvalds license] can give some results quickly, but producing all 86k results takes between 100ms and 1 second. Then, streaming the results to your browser, and rendering the HTML takes several seconds.
The Linux kernel (55K files, 545M data) takes about 160s to index on my x250 laptop using a single thread. The process can be parallelized for speedup.
Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM and an attached physical SSD.
zoekt
work?In short, it splits up the file in trigrams (groups of 3 unicode characters), and stores the offset of each occurrence. Substrings are found by searching different trigrams from the query at the correct distance apart.
Some further background documentation