Observability Spark on EKS

Introduction

In this post, we will learn the Observability for Spark on EKS. We will use Spark History Server to watch Spark Applications logs and check the Spark job progress via the Spark Web UI. Amazon Managed Service for Prometheus is used to collect and store the metrics generated by Spark Applications and Grafana is used to build dashboards for monitoring use cases.

Deploying the Solution

We will reuse the previous Spark on Operator example. Please follow this link to provision resources

Set up data and py script

let's navigate to one example folder under spark-k8s-operator and run the shell script to upload data and py script to the S3 bucket created by terraform above.

cd data-on-eks/analytics/terraform/spark-k8s-operator/examples/cluster-autoscaler/nvme-ephemeral-storage

Run the taxi-trip-execute.sh script with the following input. You will use the S3_BUCKET variable created earlier. Additionally, you must change YOUR_REGION_HERE with the region of your choice, us-west-2 for example.

This script will download some example taxi trip data and create duplicates of it in order to increase the size a bit. This will take a bit of time and will require a relatively fast internet connection.

cd ${DOEKS_HOME}/analytics/scripts/
chmod +x taxi-trip-execute.sh
./taxi-trip-execute.sh ${S3_BUCKET} YOUR_REGION_HERE

You can return to the blueprint directory and continue with the example

cd ${DOEKS_HOME}/analytics/terraform/spark-k8s-operator

Spark Web UI

When you submit a Spark application, Spark context is created which ideally gives you Spark Web UI to monitor the execution of the application. Monitoring includes the following.

Spark configurations used
Spark Jobs, stages, and tasks details
DAG execution
Driver and Executor resource utilization
Application logs and many more

When your application is done with the processing, Spark context will be terminated so your Web UI as well. and if you wanted to see the monitoring for already finished application, we cannot do it.

To try Spark web UI, let's update <S3_BUCKET> with your bucket name and <JOB_NAME> with "nvme-taxi-trip" in nvme-ephemeral-storage.yaml

  kubectl apply -f nvme-ephemeral-storage.yaml

Then run port forward command to expose spark web service.

kubectl port-forward po/taxi-trip 4040:4040 -nspark-team-a

Then open browser and enter localhost:4040. You can view your spark application like below.

Spark History Server

As mentioned above, spark web UI will be terminated once the spark job is done. This is where Spark history Server comes into the picture, where it keeps the history (event logs) of all completed applications and its runtime information which allows you to review metrics and monitor the application later in time.

In this example, we installed Spark history Server to read logs from S3 bucket. In your spark application yaml file, make sure you have the following setting:

    sparkConf:
        "spark.hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.InstanceProfileCredentialsProvider"
        "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
        "spark.eventLog.enabled": "true"
        "spark.eventLog.dir": "s3a://<your bucket>/logs/"

Run port forward command to expose spark-history-server service.

kubectl port-forward services/spark-history-server 18085:80 -n spark-history-server

Then open browser and enter localhost:18085. You can view your spark history server like below.

Prometheus

Spark users must add the following config to spark application yaml file to extract the metrics from Spark Driver and Executors. In the example, they are added into nvme-ephemeral-storage.yaml already.

    "spark.ui.prometheus.enabled": "true"
    "spark.executor.processTreeMetrics.enabled": "true"
    "spark.kubernetes.driver.annotation.prometheus.io/scrape": "true"
    "spark.kubernetes.driver.annotation.prometheus.io/path": "/metrics/executors/prometheus/"
    "spark.kubernetes.driver.annotation.prometheus.io/port": "4040"
    "spark.kubernetes.driver.service.annotation.prometheus.io/scrape": "true"
    "spark.kubernetes.driver.service.annotation.prometheus.io/path": "/metrics/driver/prometheus/"
    "spark.kubernetes.driver.service.annotation.prometheus.io/port": "4040"
    "spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
    "spark.metrics.conf.*.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
    "spark.metrics.conf.master.sink.prometheusServlet.path": "/metrics/master/prometheus/"
    "spark.metrics.conf.applications.sink.prometheusServlet.path": "/metrics/applications/prometheus/"

Run port forward command to expose prometheus service.

kubectl  port-forward service/prometheus-server   8080:80 -n prometheus

Then open browser and enter localhost:8080. You can view your prometheus server like below.

Grafana

Grafana has been installed. Use the command below to access with port forward.

get grafana password

kubectl  port-forward service/grafana 8080:80 -n grafana

Introduction​

Deploying the Solution​

Set up data and py script​

Spark Web UI​

Spark History Server​

Prometheus​

Grafana​