|tagger||Luca Milanesio <firstname.lastname@example.org>||Thu Jan 25 09:13:45 2018 -0800|
|author||Stefano Galarraga <email@example.com>||Sat Jan 20 17:32:21 2018 +0000|
|committer||Luca Milanesio <firstname.lastname@example.org>||Thu Jan 25 13:01:43 2018 +0000|
Move Docker build into sbt using git info for the version number. Centralise the generation of the ETL Docker image in the sbt main build and use the "git describe" output as versioning for it. Remove the version from the ETL fat jar name since it is going to be inside the docker container anyway. Change-Id: Ia4a9aeed95303062a5e1df9007e08bfe18dfb676
Spark ETL to extra analytics data from Gerrit Projects.
Job can be launched with the following parameters:
bin/spark-submit \ --conf spark.es.nodes=es.mycompany.com \ $JARS/SparkAnalytics-assembly.jar \ --since 2000-06-01 \ --aggregate email_hour \ --url http://gerrit.mycompany.com \ -e gerrit/analytics
Should ElasticSearch need authentication (i.e.: if X-Pack is enabled), credentials can be passed through the spark.es.net.http.auth.pass and spark.es.net.http.auth.user parameters.
since, until, aggregate are the same defined in Gerrit Analytics plugin see: https://gerrit.googlesource.com/plugins/analytics/+/master/README.md
-u --url Gerrit server URL with the analytics plugins installed
-p --prefix (optional) Projects prefix. Limit the results to those projects that start with the specified prefix.
-e --elasticIndex specify as / to be loaded in Elastic Search if not provided no ES export will be performed
-o --out folder location for storing the output as JSON files if not provided data is saved to /analytics- where is the system temporary directory
-a --email-aliases (optional) “emails to author alias” input data path.
CSVs with 3 columns are expected in input.
Here an example of the required files structure:
author,email,organization John Smith,email@example.com,John's Company John Smith,firstname.lastname@example.org,John's Company David Smith,email@example.com,Indipendent David Smith,firstname.lastname@example.org,Indipendent
You can use the following command to quickly extract the list of authors and emails to create part of an input CSV file:
echo -e "author,email\n$(git log --pretty="%an,%ae%n%cn,%ce"|sort |uniq )" > /tmp/my_aliases.csv
Once you have it, you just have to add the organization column.
A docker compose file is provided to spin up an instance of Elastisearch with Kibana locally. Just run
Kibana will run on port
5601 and Elastisearch on port
If Elastisearch dies with
exit code 137 you might have to give Docker more memory (check this article for more details)
To build the
gerritforge/spark-gerrit-analytics-etl docker container just run
sbt docker. If you want to distribute use
The build and distribution override the
latest image tag too
Remember to create an annotated tag for a relase. The tag is used to define the docker image tag too