tag	7ee65c05ae3a5fb10e0f504bfb589ddca7705841
tagger	Luca Milanesio <luca.milanesio@gmail.com>	Thu Jan 25 09:13:45 2018 -0800
object	2f8cd15a6af041c0b4358b42dd8eb7e8dbee4783

commit	2f8cd15a6af041c0b4358b42dd8eb7e8dbee4783	[log] [tgz]
author	Stefano Galarraga <galarragas@gmail.com>	Sat Jan 20 17:32:21 2018 +0000
committer	Luca Milanesio <luca.milanesio@gmail.com>	Thu Jan 25 13:01:43 2018 +0000
tree	4b33d149d1d139e3bc5565f83969c48461149280
parent	201d957806558d88f764561213fced49dfa1e7b6 [diff]

tree: 4b33d149d1d139e3bc5565f83969c48461149280

README.md

Gerrit Analytics ETL

Spark ETL to extra analytics data from Gerrit Projects.

Requires a Gerrit 2.13.x or later with the analytics plugin installed and Apache Spark 2.11 or later.

Job can be launched with the following parameters:

bin/spark-submit \
    --conf spark.es.nodes=es.mycompany.com \
    $JARS/SparkAnalytics-assembly.jar \
    --since 2000-06-01 \
    --aggregate email_hour \
    --url http://gerrit.mycompany.com \
    -e gerrit/analytics

Should ElasticSearch need authentication (i.e.: if X-Pack is enabled), credentials can be passed through the spark.es.net.http.auth.pass and spark.es.net.http.auth.user parameters.

Parameters

since, until, aggregate are the same defined in Gerrit Analytics plugin see: https://gerrit.googlesource.com/plugins/analytics/+/master/README.md
-u --url Gerrit server URL with the analytics plugins installed
-p --prefix (optional) Projects prefix. Limit the results to those projects that start with the specified prefix.
-e --elasticIndex specify as / to be loaded in Elastic Search if not provided no ES export will be performed
-o --out folder location for storing the output as JSON files if not provided data is saved to /analytics- where is the system temporary directory
-a --email-aliases (optional) “emails to author alias” input data path.
CSVs with 3 columns are expected in input.
Here an example of the required files structure:
```
author,email,organization
John Smith,john@email.com,John's Company
John Smith,john@anotheremail.com,John's Company
David Smith,david.smith@email.com,Indipendent
David Smith,david@myemail.com,Indipendent
```
You can use the following command to quickly extract the list of authors and emails to create part of an input CSV file:
```
echo -e "author,email\n$(git log --pretty="%an,%ae%n%cn,%ce"|sort |uniq )" > /tmp/my_aliases.csv
```
Once you have it, you just have to add the organization column.
NOTE:
- organization will be extracted from the committer email if not specified
- author will be defaulted to the committer name if not specified

Development environment

A docker compose file is provided to spin up an instance of Elastisearch with Kibana locally. Just run docker-compose up.

Kibana will run on port 5601 and Elastisearch on port 9200

Caveats

If Elastisearch dies with exit code 137 you might have to give Docker more memory (check this article for more details)

Distribute as Docker Container

To build the gerritforge/spark-gerrit-analytics-etl docker container just run sbt docker. If you want to distribute use sbt dockerBuildAndPush.

The build and distribution override the latest image tag too

Remember to create an annotated tag for a relase. The tag is used to define the docker image tag too