README.md - apps/analytics-etl - Git at Google

 # Intro

 This repository provides a set of spark ETL jobs able to extract, transform and persist data from
 gerrit projects with the purpose of performing analytics tasks.

 Each job focuses on a specific dataset and it knows how to extract it, filter it, aggregate it,
 transform it and then persist it.

 The persistent storage of choice is *elasticsearch*, which plays very well with the *kibana* dashboard for
 visualizing the analytics.

 All jobs are configured as separate sbt projects and have in common just a thin layer of core
 dependencies, such as spark, elasticsearch client, test utils, etc.

 Each job can be built and published independently, both as a fat jar artifact or a docker image.

 # Spark ETL jobs

 Here below an exhaustive list of all the spark jobs provided by this repo, along with their documentation.

 ## Git Commits

 Extracts and aggregates git commits data from Gerrit Projects.

 Requires a [Gerrit 2.13.x](https://www.gerritcodereview.com/releases/README.md) or later
 with the [analytics](https://gerrit.googlesource.com/plugins/analytics/)
 plugin installed and [Apache Spark 2.11](https://spark.apache.org/downloads.html) or later.

 Job can be launched with the following parameters:

 ```bash
 bin/spark-submit \
     --class com.gerritforge.analytics.gitcommits.job.Main \
     --conf spark.es.nodes=es.mycompany.com \
     $JARS/analytics-etl-gitcommits.jar \
     --since 2000-06-01 \
     --aggregate email_hour \
     --url http://gerrit.mycompany.com \
     -e gerrit \
     --username gerrit-api-username \
     --password gerrit-api-password
 ```

 You can also run this job in docker:

 ```bash
 docker run -ti --rm \
     -e ES_HOST="es.mycompany.com" \
     -e GERRIT_URL="http://gerrit.mycompany.com" \
     -e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour -e gerrit" \
     gerritforge/gerrit-analytics-etl-gitcommits:latest
 ```

 ### Parameters
 - since, until, aggregate are the same defined in Gerrit Analytics plugin
     see: https://gerrit.googlesource.com/plugins/analytics/+/master/README.md
 - -u --url Gerrit server URL with the analytics plugins installed
 - -p --prefix (*optional*) Projects prefix. Limit the results to those projects that start with the specified prefix.
 - -e --elasticIndex Elastic Search index name. If not provided no ES export will be performed. _Note: ElastiSearch 6.x
 requires this index format `name/type`, while from ElasticSearch 7.x just `name`_
 - -r --extract-branches Extract and process branches information (Optional) - Default: false
 - -o --out folder location for storing the output as JSON files
     if not provided data is saved to </tmp>/analytics-<NNNN> where </tmp> is
     the system temporary directory
 - -a --email-aliases (*optional*) "emails to author alias" input data path.
 - -k --ignore-ssl-cert allows to proceed even for server connections otherwise considered insecure.

   CSVs with 3 columns are expected in input.

   Here an example of the required files structure:
   ```csv
   author,email,organization
   John Smith,john@email.com,John's Company
   John Smith,john@anotheremail.com,John's Company
   David Smith,david.smith@email.com,Indipendent
   David Smith,david@myemail.com,Indipendent
   ```

   You can use the following command to quickly extract the list of authors and emails to create part of an input CSV file:
   ```bash
   echo -e "author,email\n$(git log --pretty="%an,%ae%n%cn,%ce"|sort |uniq )" > /tmp/my_aliases.csv
   ```
   Once you have it, you just have to add the organization column.

   *NOTE:*
   * **organization** will be extracted from the committer email if not specified
   * **author** will be defaulted to the committer name if not specified

 ### Build

 #### JAR
 To build the jar file, simply use

 `sbt analyticsETLGitCommits/assembly`

 #### Docker

 To build the *gerritforge/gerrit-analytics-etl-gitcommits* docker container just run:

 `sbt analyticsETLGitCommits/docker`.

 If you want to distribute use:

 `sbt analyticsETLGitCommits/dockerBuildAndPush`.

 The build and distribution override the `latest` image tag too
 Remember to create an annotated tag for a release. The tag is used to define the docker image tag too

 ## Audit Logs

 Extract, aggregate and persist auditLog entries produced by Gerrit via the [audit-sl4j](https://gerrit.googlesource.com/plugins/audit-sl4j/) plugin.
 AuditLog entries are an immutable trace of what happened on Gerrit and this ETL can leverage that to answer questions such as:

 - How is GIT incoming traffic distributed?
 - Git/SSH vs. Git/HTTP traffic
 - Git receive-pack vs. upload-pack
 - Top#10 users of receive-pack

 and many others questions related to the usage of Gerrit.

 Job can be launched, for example, with the following parameters:

 ```bash
 spark-submit \
     --class com.gerritforge.analytics.auditlog.job.Main \
     --conf spark.es.nodes=es.mycompany.com \
     --conf spark.es.port=9200 \
     --conf spark.es.index.auto.create=true \
     $JARS/analytics-etl-auditlog.jar \
         --gerritUrl https://gerrit.mycompany.com \
         --elasticSearchIndex gerrit \
         --eventsPath /path/to/auditlogs \
         --ignoreSSLCert false \
         --since 2000-06-01 \
         --until 2020-12-01
 ```

 You can also run this job in docker:

 ```bash
 docker run \
     --volume <source>/audit_log:/app/events/audit_log -ti --rm \
     -e ES_HOST="<elasticsearch_url>" \
     -e GERRIT_URL="http://<gerrit_url>:<gerrit_port>" \
     -e ANALYTICS_ARGS="--elasticSearchIndex gerrit --eventsPath /app/events/audit_log --ignoreSSLCert false --since 2000-06-01 --until 2020-12-01 -a hour" \
     gerritforge/gerrit-analytics-etl-auditlog:latest
 ```

 ## Parameters

 * -u, --gerritUrl              - gerrit server URL (Required)
 * --username                   - Gerrit API Username (Optional)
 * --password                   - Gerrit API Password (Optional)
 * -i, --elasticSearchIndex     - elasticSearch index to persist data into (Required)
 * -p, --eventsPath             - path to a directory (or a file) containing auditLogs events. Supports also _.gz_ files. (Required)
 * -a, --eventsTimeAggregation  - Events of the same type, produced by the same user will be aggregated with this time granularity: 'second', 'minute', 'hour', 'week', 'month', 'quarter'. (Optional) - Default: 'hour'
 * -k, --ignoreSSLCert          - Ignore SSL certificate validation (Optional) - Default: false
 * -s, --since                  - process only auditLogs occurred after (and including) this date (Optional)
 * -u, --until                  - process only auditLogs occurred before (and including) this date (Optional)
 * -a, --additionalUserInfoPath - path to a CSV file containing additional user information (Optional). Currently it is only possible to add user `type` (i.e.: _bot_, _human_).
 If the type is not specified the user will be considered _human_.

   Here an additional user information CSV file example:
   ```csv
     id,type
     123,"bot"
     456,"bot"
     789,"human"
   ```

 ### Build

 #### JAR
 To build the jar file, simply use

 `sbt analyticsETLAuditLog/assembly`

 #### Docker

 To build the *gerritforge/gerrit-analytics-etl-auditlog* docker image just run:

 `sbt analyticsETLAuditLog/docker`.

 If you want to distribute it use:

 `sbt analyticsETLAuditLog/dockerBuildAndPush`.

 The build and distribution override the `latest` image tag too.


 # Development environment

 A docker compose file is provided to spin up an instance of Elastisearch with Kibana locally.
 Just run `docker-compose up`.

 ## Caveats

 * If you want to run the git ETL job from within docker against containerized elasticsearch and/or gerrit instances, you need
   to make them reachable by the ETL container. You can do this by spinning the ETL within the same network used by your elasticsearch/gerrit container (use `--network` argument)

 * If elasticsearch or gerrit run on your host machine, then you need to make _that_ reachable by the ETL container.
   You can do this by providing routing to the docker host machine (i.e. `--add-host="gerrit:<your_host_ip_address>"` `--add-host="elasticsearch:<your_host_ip_address>"`)

   For example:

   * Run gitcommits ETL:
   ```bash
   HOST_IP=`ifconfig en0 | grep "inet " | awk '{print $2}'` \
       docker run -ti --rm \
           --add-host="gerrit:$HOST_IP" \
           --network analytics-etl_ek \
           -e ES_HOST="elasticsearch" \
           -e GERRIT_URL="http://$HOST_IP:8080" \
           -e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour -e gerrit" \
           gerritforge/gerrit-analytics-etl-gitcommits:latest
   ```

   * Run auditlog ETL:
     ```bash
     HOST_IP=`ifconfig en0 | grep "inet " | awk '{print $2}'` \
         docker run -ti --rm --volume <source>/audit_log:/app/events/audit_log \
         --add-host="gerrit:$HOST_IP" \
         --network analytics-wizard_ek \
         -e ES_HOST="elasticsearch" \
         -e GERRIT_URL="http://$HOST_IP:8181" \
         -e ANALYTICS_ARGS="--elasticSearchIndex gerrit --eventsPath /app/events/audit_log --ignoreSSLCert true --since 2000-06-01 --until 2020-12-01 -a hour" \
         gerritforge/gerrit-analytics-etl-auditlog:latest
     ```

 * If Elastisearch dies with `exit code 137` you might have to give Docker more memory ([check this article for more details](https://github.com/moby/moby/issues/22211))

 * Should ElasticSearch need authentication (i.e.: if X-Pack is enabled), credentials can be passed through the *spark.es.net.http.auth.pass* and *spark.es.net.http.auth.user* parameters.

 * If the dockerized spark job cannot connect to elasticsearch (also, running on docker) you might need to tell elasticsearch to publish
 the host to the cluster using the \_site\_ address.

 ```
 elasticsearch:
     ...
     environment:
        ...
       - http.host=0.0.0.0
       - network.host=_site_
       - http.publish_host=_site_
       ...
 ```

 See [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#network-interface-values) for more info

 ## Build all

 To perform actions across all jobs simply run the relevant *sbt* task without specifying the job name. For example:

 * Test all jobs: `sbt test`
 * Build jar for all jobs: `sbt assembly`
 * Build docker for all jobs: `sbt docker`
	# Intro

	This repository provides a set of spark ETL jobs able to extract, transform and persist data from
	gerrit projects with the purpose of performing analytics tasks.

	Each job focuses on a specific dataset and it knows how to extract it, filter it, aggregate it,
	transform it and then persist it.

	The persistent storage of choice is elasticsearch, which plays very well with the kibana dashboard for
	visualizing the analytics.

	All jobs are configured as separate sbt projects and have in common just a thin layer of core
	dependencies, such as spark, elasticsearch client, test utils, etc.

	Each job can be built and published independently, both as a fat jar artifact or a docker image.

	# Spark ETL jobs

	Here below an exhaustive list of all the spark jobs provided by this repo, along with their documentation.

	## Git Commits

	Extracts and aggregates git commits data from Gerrit Projects.

	Requires a [Gerrit 2.13.x](https://www.gerritcodereview.com/releases/README.md) or later
	with the [analytics](https://gerrit.googlesource.com/plugins/analytics/)
	plugin installed and [Apache Spark 2.11](https://spark.apache.org/downloads.html) or later.

	Job can be launched with the following parameters:

	```bash
	bin/spark-submit \
	--class com.gerritforge.analytics.gitcommits.job.Main \
	--conf spark.es.nodes=es.mycompany.com \
	$JARS/analytics-etl-gitcommits.jar \
	--since 2000-06-01 \
	--aggregate email_hour \
	--url http://gerrit.mycompany.com \
	-e gerrit \
	--username gerrit-api-username \
	--password gerrit-api-password
	```

	You can also run this job in docker:

	```bash
	docker run -ti --rm \
	-e ES_HOST="es.mycompany.com" \
	-e GERRIT_URL="http://gerrit.mycompany.com" \
	-e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour -e gerrit" \
	gerritforge/gerrit-analytics-etl-gitcommits:latest
	```

	### Parameters
	- since, until, aggregate are the same defined in Gerrit Analytics plugin
	see: https://gerrit.googlesource.com/plugins/analytics/+/master/README.md
	- -u --url Gerrit server URL with the analytics plugins installed
	- -p --prefix (optional) Projects prefix. Limit the results to those projects that start with the specified prefix.
	- -e --elasticIndex Elastic Search index name. If not provided no ES export will be performed. _Note: ElastiSearch 6.x
	requires this index format `name/type`, while from ElasticSearch 7.x just `name`_
	- -r --extract-branches Extract and process branches information (Optional) - Default: false
	- -o --out folder location for storing the output as JSON files
	if not provided data is saved to </tmp>/analytics-<NNNN> where </tmp> is
	the system temporary directory
	- -a --email-aliases (optional) "emails to author alias" input data path.
	- -k --ignore-ssl-cert allows to proceed even for server connections otherwise considered insecure.

	CSVs with 3 columns are expected in input.

	Here an example of the required files structure:
	```csv
	author,email,organization
	John Smith,john@email.com,John's Company
	John Smith,john@anotheremail.com,John's Company
	David Smith,david.smith@email.com,Indipendent
	David Smith,david@myemail.com,Indipendent
	```

	You can use the following command to quickly extract the list of authors and emails to create part of an input CSV file:
	```bash
	echo -e "author,email\n$(git log --pretty="%an,%ae%n%cn,%ce"\|sort \|uniq )" > /tmp/my_aliases.csv
	```
	Once you have it, you just have to add the organization column.

	NOTE:
	* organization will be extracted from the committer email if not specified
	* author will be defaulted to the committer name if not specified

	### Build

	#### JAR
	To build the jar file, simply use

	`sbt analyticsETLGitCommits/assembly`

	#### Docker

	To build the gerritforge/gerrit-analytics-etl-gitcommits docker container just run:

	`sbt analyticsETLGitCommits/docker`.

	If you want to distribute use:

	`sbt analyticsETLGitCommits/dockerBuildAndPush`.

	The build and distribution override the `latest` image tag too
	Remember to create an annotated tag for a release. The tag is used to define the docker image tag too

	## Audit Logs

	Extract, aggregate and persist auditLog entries produced by Gerrit via the [audit-sl4j](https://gerrit.googlesource.com/plugins/audit-sl4j/) plugin.
	AuditLog entries are an immutable trace of what happened on Gerrit and this ETL can leverage that to answer questions such as:

	- How is GIT incoming traffic distributed?
	- Git/SSH vs. Git/HTTP traffic
	- Git receive-pack vs. upload-pack
	- Top#10 users of receive-pack

	and many others questions related to the usage of Gerrit.

	Job can be launched, for example, with the following parameters:

	```bash
	spark-submit \
	--class com.gerritforge.analytics.auditlog.job.Main \
	--conf spark.es.nodes=es.mycompany.com \
	--conf spark.es.port=9200 \
	--conf spark.es.index.auto.create=true \
	$JARS/analytics-etl-auditlog.jar \
	--gerritUrl https://gerrit.mycompany.com \
	--elasticSearchIndex gerrit \
	--eventsPath /path/to/auditlogs \
	--ignoreSSLCert false \
	--since 2000-06-01 \
	--until 2020-12-01
	```

	You can also run this job in docker:

	```bash
	docker run \
	--volume <source>/audit_log:/app/events/audit_log -ti --rm \
	-e ES_HOST="<elasticsearch_url>" \
	-e GERRIT_URL="http://<gerrit_url>:<gerrit_port>" \
	-e ANALYTICS_ARGS="--elasticSearchIndex gerrit --eventsPath /app/events/audit_log --ignoreSSLCert false --since 2000-06-01 --until 2020-12-01 -a hour" \
	gerritforge/gerrit-analytics-etl-auditlog:latest
	```

	## Parameters

	* -u, --gerritUrl - gerrit server URL (Required)
	* --username - Gerrit API Username (Optional)
	* --password - Gerrit API Password (Optional)
	* -i, --elasticSearchIndex - elasticSearch index to persist data into (Required)
	* -p, --eventsPath - path to a directory (or a file) containing auditLogs events. Supports also _.gz_ files. (Required)
	* -a, --eventsTimeAggregation - Events of the same type, produced by the same user will be aggregated with this time granularity: 'second', 'minute', 'hour', 'week', 'month', 'quarter'. (Optional) - Default: 'hour'
	* -k, --ignoreSSLCert - Ignore SSL certificate validation (Optional) - Default: false
	* -s, --since - process only auditLogs occurred after (and including) this date (Optional)
	* -u, --until - process only auditLogs occurred before (and including) this date (Optional)
	* -a, --additionalUserInfoPath - path to a CSV file containing additional user information (Optional). Currently it is only possible to add user `type` (i.e.: _bot_, _human_).
	If the type is not specified the user will be considered _human_.

	Here an additional user information CSV file example:
	```csv
	id,type
	123,"bot"
	456,"bot"
	789,"human"
	```

	### Build

	#### JAR
	To build the jar file, simply use

	`sbt analyticsETLAuditLog/assembly`

	#### Docker

	To build the gerritforge/gerrit-analytics-etl-auditlog docker image just run:

	`sbt analyticsETLAuditLog/docker`.

	If you want to distribute it use:

	`sbt analyticsETLAuditLog/dockerBuildAndPush`.

	The build and distribution override the `latest` image tag too.


	# Development environment

	A docker compose file is provided to spin up an instance of Elastisearch with Kibana locally.
	Just run `docker-compose up`.

	## Caveats

	* If you want to run the git ETL job from within docker against containerized elasticsearch and/or gerrit instances, you need
	to make them reachable by the ETL container. You can do this by spinning the ETL within the same network used by your elasticsearch/gerrit container (use `--network` argument)

	* If elasticsearch or gerrit run on your host machine, then you need to make _that_ reachable by the ETL container.
	You can do this by providing routing to the docker host machine (i.e. `--add-host="gerrit:<your_host_ip_address>"` `--add-host="elasticsearch:<your_host_ip_address>"`)

	For example:

	* Run gitcommits ETL:
	```bash
	HOST_IP=`ifconfig en0 \| grep "inet " \| awk '{print $2}'` \
	docker run -ti --rm \
	--add-host="gerrit:$HOST_IP" \
	--network analytics-etl_ek \
	-e ES_HOST="elasticsearch" \
	-e GERRIT_URL="http://$HOST_IP:8080" \
	-e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour -e gerrit" \
	gerritforge/gerrit-analytics-etl-gitcommits:latest
	```

	* Run auditlog ETL:
	```bash
	HOST_IP=`ifconfig en0 \| grep "inet " \| awk '{print $2}'` \
	docker run -ti --rm --volume <source>/audit_log:/app/events/audit_log \
	--add-host="gerrit:$HOST_IP" \
	--network analytics-wizard_ek \
	-e ES_HOST="elasticsearch" \
	-e GERRIT_URL="http://$HOST_IP:8181" \
	-e ANALYTICS_ARGS="--elasticSearchIndex gerrit --eventsPath /app/events/audit_log --ignoreSSLCert true --since 2000-06-01 --until 2020-12-01 -a hour" \
	gerritforge/gerrit-analytics-etl-auditlog:latest
	```

	* If Elastisearch dies with `exit code 137` you might have to give Docker more memory ([check this article for more details](https://github.com/moby/moby/issues/22211))

	* Should ElasticSearch need authentication (i.e.: if X-Pack is enabled), credentials can be passed through the spark.es.net.http.auth.pass and spark.es.net.http.auth.user parameters.

	* If the dockerized spark job cannot connect to elasticsearch (also, running on docker) you might need to tell elasticsearch to publish
	the host to the cluster using the \_site\_ address.

	```
	elasticsearch:
	...
	environment:
	...
	- http.host=0.0.0.0
	- network.host=_site_
	- http.publish_host=_site_
	...
	```

	See [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#network-interface-values) for more info

	## Build all

	To perform actions across all jobs simply run the relevant sbt task without specifying the job name. For example:

	* Test all jobs: `sbt test`
	* Build jar for all jobs: `sbt assembly`
	* Build docker for all jobs: `sbt docker`