Dockerize analytics ETL spark job
Update sbt to generate a docker image running containerized
ETL analytics job.
The entrypoint of the docker file is a provided shell script
that provides the relevant configuration and environment
variables to the spark-submit command.
Feature: Issue 9856
Change-Id: I8e1ad454bac79860a518d0f45e6c8016d0507350
diff --git a/README.md b/README.md
index b74e19f..2cdf5d2 100644
--- a/README.md
+++ b/README.md
@@ -7,20 +7,30 @@
Job can be launched with the following parameters:
-```
+```bash
bin/spark-submit \
--conf spark.es.nodes=es.mycompany.com \
- $JARS/SparkAnalytics-assembly.jar \
+ $JARS/analytics-etl.jar \
--since 2000-06-01 \
--aggregate email_hour \
--url http://gerrit.mycompany.com \
- --events file:///tmp/gerrit-events-export.json
- --writeNotProcessedEventsTo file:///tmp/failed-events
- -e gerrit/analytics
- --username gerrit-api-username
+ --events file:///tmp/gerrit-events-export.json \
+ --writeNotProcessedEventsTo file:///tmp/failed-events \
+ -e gerrit/analytics \
+ --username gerrit-api-username \
--password gerrit-api-password
```
+You can also run this job in docker:
+
+```bash
+docker run -ti --rm \
+ -e ES_HOST="es.mycompany.com" \
+ -e GERRIT_URL="http://gerrit.mycompany.com" \
+ -e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour --writeNotProcessedEventsTo file:///tmp/failed-events -e gerrit/analytics" \
+ gerritforge/spark-gerrit-analytics-etl:latest
+```
+
Should ElasticSearch need authentication (i.e.: if X-Pack is enabled), credentials can be
passed through the *spark.es.net.http.auth.pass* and *spark.es.net.http.auth.user* parameters.
### Parameters
@@ -67,11 +77,44 @@
A docker compose file is provided to spin up an instance of Elastisearch with Kibana locally.
Just run `docker-compose up`.
-Kibana will run on port `5601` and Elastisearch on port `9200`
-
### Caveats
-If Elastisearch dies with `exit code 137` you might have to give Docker more memory ([check this article for more details](https://github.com/moby/moby/issues/22211))
+* If Elastisearch dies with `exit code 137` you might have to give Docker more memory ([check this article for more details](https://github.com/moby/moby/issues/22211))
+
+* If you want to run the etl job from within docker you need to make elasticsearch and gerrit available to it.
+ You can do this by:
+
+ * spinning the container within the same network used by your elasticsearch container (`analytics-etl_ek` if you used the docker-compose provided by this repo)
+ * provide routing to the docker host machine (via `--add-host="gerrit:<your_host_ip_address>"`)
+
+ For example:
+
+ ```bash
+ HOST_IP=`ifconfig en0 | grep "inet " | awk '{print $2}'` \
+ docker run -ti --rm \
+ --add-host="gerrit:$HOST_IP" \
+ --network analytics-etl_ek \
+ -e ES_HOST="elasticsearch" \
+ -e GERRIT_URL="http://$HOST_IP:8080" \
+ -e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour --writeNotProcessedEventsTo file:///tmp/failed-events -e gerrit/analytics" \
+ gerritforge/spark-gerrit-analytics-etl:latest
+ ```
+
+* If the dockerized spark job cannot connect to elasticsearch (also, running on docker) you might need to tell elasticsearch to publish
+the host to the cluster using the \_site\_ address.
+
+```
+elasticsearch:
+ ...
+ environment:
+ ...
+ - http.host=0.0.0.0
+ - network.host=_site_
+ - http.publish_host=_site_
+ ...
+```
+
+See [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#network-interface-values) for more info
## Distribute as Docker Container
@@ -80,4 +123,4 @@
The build and distribution override the `latest` image tag too
-Remember to create an annotated tag for a relase. The tag is used to define the docker image tag too
\ No newline at end of file
+Remember to create an annotated tag for a release. The tag is used to define the docker image tag too
\ No newline at end of file
diff --git a/build.sbt b/build.sbt
index a7302a6..75aea54 100644
--- a/build.sbt
+++ b/build.sbt
@@ -1,4 +1,5 @@
import sbt.Keys.version
+import sbtassembly.AssemblyKeys
enablePlugins(GitVersioning)
enablePlugins(DockerPlugin)
@@ -11,18 +12,21 @@
scalaVersion := "2.11.8"
-val sparkVersion = "2.1.1"
+val sparkVersion = "2.3.2"
val gerritApiVersion = "2.13.7"
val pluginName = "analytics-etl"
+val mainClassPackage = "com.gerritforge.analytics.job.Main"
+val dockerRepository = "spark-gerrit-analytics-etl"
+
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided"
- exclude("org.spark-project.spark", "unused"),
+ exclude("org.spark-project.spark", "unused"),
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.elasticsearch" %% "elasticsearch-spark-20" % "6.2.0"
- excludeAll ExclusionRule(organization = "org.apache.spark"),
+ excludeAll ExclusionRule(organization = "org.apache.spark"),
// json4s still needed by GerritProjects
"org.json4s" %% "json4s-native" % "3.2.11",
"com.google.gerrit" % "gerrit-plugin-api" % gerritApiVersion % Provided withSources(),
@@ -34,34 +38,49 @@
"org.scalatest" %% "scalatest" % "3.0.1" % "test"
)
-mainClass in (Compile,run) := Some("com.gerritforge.analytics.job.Main")
+mainClass in (Compile,run) := Some(mainClassPackage)
assemblyJarName in assembly := s"${name.value}.jar"
parallelExecution in Test := false
+// Docker settings
+docker := (docker dependsOn AssemblyKeys.assembly).value
+
dockerfile in docker := {
val artifact: File = assembly.value
val artifactTargetPath = s"/app/${name.value}-assembly.jar"
+ val entryPointPath = s"/app/gerrit-analytics-etl.sh"
new Dockerfile {
- from("gerritforge/jw2017-spark")
- runRaw("apk add --no-cache wget")
- runRaw("mkdir -p /app")
- add(artifact, artifactTargetPath )
+ from("openjdk:8-alpine")
+ label("maintainer" -> "GerritForge <info@gerritforge.com>")
+ runRaw("apk --update add curl tar bash && rm -rf /var/lib/apt/lists/* && rm /var/cache/apk/*")
+ env("SPARK_VERSION", sparkVersion)
+ env("SPARK_HOME", "/usr/local/spark-$SPARK_VERSION-bin-hadoop2.7")
+ env("PATH","$PATH:$SPARK_HOME/bin")
+ env("SPARK_JAR_PATH", artifactTargetPath)
+ env("SPARK_JAR_CLASS",mainClassPackage)
+ runRaw("curl -sL \"http://www-eu.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop2.7.tgz\" | tar -xz -C /usr/local")
+ copy(baseDirectory(_ / "scripts" / "gerrit-analytics-etl.sh").value, file(entryPointPath))
+ add(artifact, artifactTargetPath)
+ runRaw(s"chmod +x $artifactTargetPath")
+ cmd(s"/bin/sh", entryPointPath)
}
}
-
imageNames in docker := Seq(
- ImageName(s"${organization.value}/spark-gerrit-analytics-etl:latest"),
+ ImageName(
+ namespace = Some(organization.value),
+ repository = dockerRepository,
+ tag = Some("latest")
+ ),
ImageName(
namespace = Some(organization.value),
- repository = "spark-gerrit-analytics-etl",
+ repository = dockerRepository,
tag = Some(version.value)
)
)
-
buildOptions in docker := BuildOptions(cache = false)
packageOptions in(Compile, packageBin) += Package.ManifestAttributes(
diff --git a/docker-compose.yaml b/docker-compose.yaml
index f9a5483..5b1dbee 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -27,7 +27,10 @@
environment:
- ES_JAVA_OPTS=-Xmx256m -Xms256m
- http.host=0.0.0.0
- - http.publish_host=127.0.0.1
+# Comment out for making containers reachable from dockerized spark job.
+# See README.md for more info:
+# - network.host=_site_
+# - http.publish_host=_site_
networks:
- ek
diff --git a/scripts/gerrit-analytics-etl.sh b/scripts/gerrit-analytics-etl.sh
new file mode 100755
index 0000000..b4c1aca
--- /dev/null
+++ b/scripts/gerrit-analytics-etl.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+set -o errexit
+
+test -z "$ES_HOST" && ( echo "ES_HOST is not set; exiting" ; exit 1 )
+test -z "$ANALYTICS_ARGS" && ( echo "ANALYTICS_ARGS is not set; exiting" ; exit 1 )
+test -z "$GERRIT_URL" && ( echo "GERRIT_URL is not set; exiting" ; exit 1 )
+
+echo "Elastic Search Host: $ES_HOST"
+echo "Gerrit URL: $GERRIT_URL"
+echo "Analytics arguments: $ANALYTICS_ARGS"
+echo "Spark jar class: $SPARK_JAR_CLASS"
+echo "Spark jar path: $SPARK_JAR_PATH"
+
+spark-submit \
+ --conf spark.es.nodes="$ES_HOST" \
+ --class $SPARK_JAR_CLASS $SPARK_JAR_PATH \
+ --url $GERRIT_URL \
+ $ANALYTICS_ARGS
\ No newline at end of file