| Modified Regular Expressions |
| ============================ |
| |
| The copyright scanner has to meet several competing requirements: 1) it must |
| be fast to keep server load and latency reasonable, 2) it must find copyrights |
| and licenses appearing in and interrupted by any number of comment formats, |
| and 3) it must be configurable by mere mortals. |
| |
| Configured patterns are basically [regular |
| expressions](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) |
| with modifications to assist the 3 requirements above. |
| |
| ## Unlimited Wildcards |
| |
| If you run the ScanTool.java command-line tool against the files in your |
| repository, it will already find a lot of the copyright and license declarations |
| that interest. Often it will pick up a couple extra junk words before or after |
| the text you want. In these cases, the natural inclination to use wildcards: |
| e.g. something like `.*My License.*` in a first-party pattern. However, when |
| scanning a large file, those wildcards can pick up a lot more junk words and |
| slow the scan to a crawl. |
| |
| To allow the easy and obvious choice, the plugin allows you to use the .* and .+ |
| patterns, and it modifies your pattern before scanning. When they appear at the |
| start or the end of the pattern, the plugin removes them before scanning the |
| file and puts them back when classifying the matches. |
| |
| The scanner will still find all of the matches containing the pattern that it |
| found before plus it might find a few exact matches without increasing the work |
| to scan the file. The matches are limited in size. When it puts the wildcards |
| back to analyze the matches, the wildcards in your first-party pattern cause |
| it to match despite the junk words so it correctly identifies the first-party |
| license without costing much cpu time. |
| |
| When they appear in the middle of your pattern, the plugin replaces them with a |
| pattern that matches a limited number of words separated by a limited number of |
| whitespace characters. The limits are large enough generally to match anything |
| of interest, but small enough to keep the scan down to a reasonable time. |
| |
| If the limits cause the plugin to miss too many desired hits, it's always |
| possible to write a more complex pattern with different limits. |
| |
| ## Comment Characters and Whitespace |
| |
| Whitespace does not affect the meaning of a license. The same text may be |
| formatted all manner of different ways using spaces, tabs, newlines, etc. to |
| align margins, center text, word-wrap at different columns etc. without changing |
| the meaning. Many times the same text will be incorporated into comments of |
| different languages. Even in the same language, sometimes the author will use |
| multi-line comments /*...*/ and sometimes the author will use single-line |
| comments //. Some languages use a completely different comment character #. |
| |
| You could insert a complex expression anywhere you expect whitespace. |
| e.g. `[\\s/*#]+` But the configuration will already be unreadable, and if a file |
| has a lot of whitespace, that pattern could match large blocks while slowing |
| the scan. |
| |
| To keep configurations readable, the plugin substitutes any embedded spaces with |
| a regular expression to match a limited number of whitespace or comment |
| characters where the limit is long enough to match almost all of the potential |
| hits without slowing the scan too much. |
| |
| Mostly you don't need to worry about the details. Just use a space between words |
| even if they might appear on different lines etc. |
| |
| However, if you use a space inside a character class [ a-z0-9], the plugin will |
| reject the pattern. You need to break that out into a more complicated pattern: |
| `(?: |[a-z0-9])` |
| |
| ## Capture Groups |
| |
| The plugin uses capture groups to keep track of licenses versus owners. To group |
| parts of your patterns, use a non-capturing group `(?:pattern)` instead of |
| `(pattern)`. The plugin will reject your configuration if you try to use a |
| pattern that looks like it contains a capture group. |
| |
| This might also force you to use a different order in character classes: |
| e.g. `[)a-d(]` instead of `[(a-d)]` |