Dockerize analytics ETL spark job Update sbt to generate a docker image running containerized ETL analytics job. The entrypoint of the docker file is a provided shell script that provides the relevant configuration and environment variables to the spark-submit command. Feature: Issue 9856 Change-Id: I8e1ad454bac79860a518d0f45e6c8016d0507350

commit: c50190a06468ae2734748eb7ef62729b0cf85475 [log] [tgz]
author: Antonio Barone <syntonyze@gmail.com> Fri Oct 12 17:12:08 2018 +0100
committer: Antonio Barone <syntonyze@gmail.com> Fri Oct 12 17:20:36 2018 +0100
tree: f7a2f459510855085d169a71c5dc94ad0a773731
parent: c5124c3744b694b93faa9e0e743d69e089d18e7b [diff]
diff --git a/README.md b/README.md
index b74e19f..2cdf5d2 100644
--- a/README.md
+++ b/README.md

@@ -7,20 +7,30 @@
 
 Job can be launched with the following parameters:
 
-```
+```bash
 bin/spark-submit \
     --conf spark.es.nodes=es.mycompany.com \
-    $JARS/SparkAnalytics-assembly.jar \
+    $JARS/analytics-etl.jar \
     --since 2000-06-01 \
     --aggregate email_hour \
     --url http://gerrit.mycompany.com \
-    --events file:///tmp/gerrit-events-export.json
-    --writeNotProcessedEventsTo file:///tmp/failed-events
-    -e gerrit/analytics
-    --username gerrit-api-username
+    --events file:///tmp/gerrit-events-export.json \
+    --writeNotProcessedEventsTo file:///tmp/failed-events \
+    -e gerrit/analytics \
+    --username gerrit-api-username \
     --password gerrit-api-password
 ```
 
+You can also run this job in docker:
+
+```bash
+docker run -ti --rm \
+    -e ES_HOST="es.mycompany.com" \
+    -e GERRIT_URL="http://gerrit.mycompany.com" \
+    -e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour --writeNotProcessedEventsTo file:///tmp/failed-events -e gerrit/analytics" \
+    gerritforge/spark-gerrit-analytics-etl:latest
+```
+
 Should ElasticSearch need authentication (i.e.: if X-Pack is enabled), credentials can be
 passed through the *spark.es.net.http.auth.pass* and *spark.es.net.http.auth.user* parameters.
 ### Parameters
@@ -67,11 +77,44 @@
 A docker compose file is provided to spin up an instance of Elastisearch with Kibana locally.
 Just run `docker-compose up`.
 
-Kibana will run on port `5601` and Elastisearch on port `9200`
-
 ### Caveats
 
-If Elastisearch dies with `exit code 137` you might have to give Docker more memory ([check this article for more details](https://github.com/moby/moby/issues/22211))
+* If Elastisearch dies with `exit code 137` you might have to give Docker more memory ([check this article for more details](https://github.com/moby/moby/issues/22211))
+
+* If you want to run the etl job from within docker you need to make elasticsearch and gerrit available to it.
+  You can do this by:
+
+    * spinning the container within the same network used by your elasticsearch container (`analytics-etl_ek` if you used the docker-compose provided by this repo)
+    * provide routing to the docker host machine (via `--add-host="gerrit:<your_host_ip_address>"`)
+
+  For example:
+
+  ```bash
+  HOST_IP=`ifconfig en0 | grep "inet " | awk '{print $2}'` \
+      docker run -ti --rm \
+           --add-host="gerrit:$HOST_IP" \
+          --network analytics-etl_ek \
+          -e ES_HOST="elasticsearch" \
+          -e GERRIT_URL="http://$HOST_IP:8080" \
+          -e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour --writeNotProcessedEventsTo file:///tmp/failed-events -e gerrit/analytics" \
+          gerritforge/spark-gerrit-analytics-etl:latest
+  ```
+
+* If the dockerized spark job cannot connect to elasticsearch (also, running on docker) you might need to tell elasticsearch to publish
+the host to the cluster using the \_site\_ address.
+
+```
+elasticsearch:
+    ...
+    environment:
+       ...
+      - http.host=0.0.0.0
+      - network.host=_site_
+      - http.publish_host=_site_
+      ...
+```
+
+See [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#network-interface-values) for more info
 
 ## Distribute as Docker Container
 
@@ -80,4 +123,4 @@
 
 The build and distribution override the `latest` image tag too
 
-Remember to create an annotated tag for a relase. The tag is used to define the docker image tag too
\ No newline at end of file
+Remember to create an annotated tag for a release. The tag is used to define the docker image tag too
\ No newline at end of file

diff --git a/build.sbt b/build.sbt
index a7302a6..75aea54 100644
--- a/build.sbt
+++ b/build.sbt

@@ -1,4 +1,5 @@
 import sbt.Keys.version
+import sbtassembly.AssemblyKeys
 
 enablePlugins(GitVersioning)
 enablePlugins(DockerPlugin)
@@ -11,18 +12,21 @@
 
 scalaVersion := "2.11.8"
 
-val sparkVersion = "2.1.1"
+val sparkVersion = "2.3.2"
 
 val gerritApiVersion = "2.13.7"
 
 val pluginName = "analytics-etl"
 
+val mainClassPackage = "com.gerritforge.analytics.job.Main"
+val dockerRepository = "spark-gerrit-analytics-etl"
+
 libraryDependencies ++= Seq(
   "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
-    exclude("org.spark-project.spark", "unused"),
+  exclude("org.spark-project.spark", "unused"),
   "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
   "org.elasticsearch" %% "elasticsearch-spark-20" % "6.2.0"
-    excludeAll ExclusionRule(organization = "org.apache.spark"),
+  excludeAll ExclusionRule(organization = "org.apache.spark"),
   // json4s still needed by GerritProjects
   "org.json4s" %% "json4s-native" % "3.2.11",
   "com.google.gerrit" % "gerrit-plugin-api" % gerritApiVersion % Provided withSources(),
@@ -34,34 +38,49 @@
   "org.scalatest" %% "scalatest" % "3.0.1" % "test"
 )
 
-mainClass in (Compile,run) := Some("com.gerritforge.analytics.job.Main")
+mainClass in (Compile,run) := Some(mainClassPackage)
 
 assemblyJarName in assembly := s"${name.value}.jar"
 
 parallelExecution in Test := false
 
+// Docker settings
+docker := (docker dependsOn AssemblyKeys.assembly).value
+
 dockerfile in docker := {
   val artifact: File = assembly.value
   val artifactTargetPath = s"/app/${name.value}-assembly.jar"
+  val entryPointPath = s"/app/gerrit-analytics-etl.sh"
 
   new Dockerfile {
-   from("gerritforge/jw2017-spark")
-    runRaw("apk add --no-cache wget")
-    runRaw("mkdir -p /app")
-    add(artifact, artifactTargetPath )
+    from("openjdk:8-alpine")
+    label("maintainer" -> "GerritForge <info@gerritforge.com>")
+    runRaw("apk --update add curl tar bash && rm -rf /var/lib/apt/lists/* && rm /var/cache/apk/*")
+    env("SPARK_VERSION", sparkVersion)
+    env("SPARK_HOME", "/usr/local/spark-$SPARK_VERSION-bin-hadoop2.7")
+    env("PATH","$PATH:$SPARK_HOME/bin")
+    env("SPARK_JAR_PATH", artifactTargetPath)
+    env("SPARK_JAR_CLASS",mainClassPackage)
+    runRaw("curl -sL \"http://www-eu.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop2.7.tgz\" | tar -xz -C /usr/local")
+    copy(baseDirectory(_ / "scripts" / "gerrit-analytics-etl.sh").value, file(entryPointPath))
+    add(artifact, artifactTargetPath)
+    runRaw(s"chmod +x $artifactTargetPath")
+    cmd(s"/bin/sh", entryPointPath)
   }
 }
-
 imageNames in docker := Seq(
-  ImageName(s"${organization.value}/spark-gerrit-analytics-etl:latest"),
+  ImageName(
+    namespace = Some(organization.value),
+    repository = dockerRepository,
+    tag = Some("latest")
+  ),
 
   ImageName(
     namespace = Some(organization.value),
-    repository = "spark-gerrit-analytics-etl",
+    repository = dockerRepository,
     tag = Some(version.value)
   )
 )
-
 buildOptions in docker := BuildOptions(cache = false)
 
 packageOptions in(Compile, packageBin) += Package.ManifestAttributes(

diff --git a/docker-compose.yaml b/docker-compose.yaml
index f9a5483..5b1dbee 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml

@@ -27,7 +27,10 @@
     environment:
       - ES_JAVA_OPTS=-Xmx256m -Xms256m
       - http.host=0.0.0.0
-      - http.publish_host=127.0.0.1
+# Comment out for making containers reachable from dockerized spark job.
+# See README.md for more info:
+#      - network.host=_site_
+#      - http.publish_host=_site_
     networks:
       - ek
 

diff --git a/scripts/gerrit-analytics-etl.sh b/scripts/gerrit-analytics-etl.sh
new file mode 100755
index 0000000..b4c1aca
--- /dev/null
+++ b/scripts/gerrit-analytics-etl.sh

@@ -0,0 +1,19 @@
+#!/bin/sh
+
+set -o errexit
+
+test -z "$ES_HOST" && ( echo "ES_HOST is not set; exiting" ; exit 1 )
+test -z "$ANALYTICS_ARGS" && ( echo "ANALYTICS_ARGS is not set; exiting" ; exit 1 )
+test -z "$GERRIT_URL" && ( echo "GERRIT_URL is not set; exiting" ; exit 1 )
+
+echo "Elastic Search Host: $ES_HOST"
+echo "Gerrit URL: $GERRIT_URL"
+echo "Analytics arguments: $ANALYTICS_ARGS"
+echo "Spark jar class: $SPARK_JAR_CLASS"
+echo "Spark jar path: $SPARK_JAR_PATH"
+
+spark-submit \
+    --conf spark.es.nodes="$ES_HOST" \
+    --class $SPARK_JAR_CLASS $SPARK_JAR_PATH \
+    --url $GERRIT_URL \
+    $ANALYTICS_ARGS
\ No newline at end of file
commit	c50190a06468ae2734748eb7ef62729b0cf85475	[log] [tgz]
author	Antonio Barone <syntonyze@gmail.com>	Fri Oct 12 17:12:08 2018 +0100
committer	Antonio Barone <syntonyze@gmail.com>	Fri Oct 12 17:20:36 2018 +0100
tree	f7a2f459510855085d169a71c5dc94ad0a773731
parent	c5124c3744b694b93faa9e0e743d69e089d18e7b [diff]