Spark Observability on EKS

Overview

This example demonstrates production-grade observability for Apache Spark on EKS using native Spark metrics exporters, Prometheus for time-series data collection, Grafana for visualization, and Spark History Server for post-execution analysis.

Key Components:

Spark UI: Real-time application monitoring (active jobs only)
Spark History Server: Historical job analysis and event log replay
PrometheusServlet: Native Spark metrics exporter to Prometheus
Kube-Prometheus-Stack: Metrics aggregation, storage, and alerting
Grafana: Customizable dashboards for operational insights

Prerequisites

info

Before proceeding, ensure you have deployed the Spark on EKS infrastructure following the Infrastructure Setup Guide.

Required Components (deployed by default):

✅ Spark on EKS infrastructure with enable_spark_history_server = true
✅ Kube-Prometheus-Stack (Prometheus + Grafana + AlertManager)
✅ S3 bucket for Spark event logs
✅ kubectl configured for cluster access

Spark Metrics Architecture

PrometheusServlet Sink

Apache Spark natively supports exporting metrics via the PrometheusServlet sink, an HTTP endpoint that exposes metrics in Prometheus format. This eliminates the need for external agents or sidecars.

How it works:

Spark exposes /metrics/driver/prometheus/ and /metrics/executors/prometheus/ endpoints
Prometheus scrapes these endpoints via Kubernetes pod annotations
Metrics are stored in Prometheus time-series database
Grafana queries Prometheus for visualization

Metrics Configuration

Add these configurations to your SparkApplication YAML to enable metrics export:

sparkConf:
  # Enable Prometheus metrics endpoint
  "spark.ui.prometheus.enabled": "true"
  "spark.executor.processTreeMetrics.enabled": "true"

  # PrometheusServlet sink configuration
  "spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
  "spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
  "spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"

Key Configuration Parameters:

spark.ui.prometheus.enabled: Enables /metrics/prometheus endpoint on Spark UI
spark.executor.processTreeMetrics.enabled: Collects process-level CPU/memory metrics
prometheusServlet.class: Configures native Prometheus sink
prometheusServlet.path: Defines HTTP endpoint path for metrics

Event Log Configuration

Spark History Server requires event logs stored in S3:

sparkConf:
  # Event logging for History Server
  "spark.eventLog.enabled": "true"
  "spark.eventLog.dir": "s3a://<S3_BUCKET>/spark-event-logs"
  "spark.eventLog.rolling.enabled": "true"
  "spark.eventLog.rolling.maxFileSize": "64m"

  # S3 access with Pod Identity (AWS SDK V2)
  "spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
  "spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
  "spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"

Important: Use ContainerCredentialsProvider (AWS SDK V2) instead of deprecated InstanceProfileCredentialsProvider for IRSA/Pod Identity authentication.

Hands-On: Deploy and Monitor Spark Jobs

Step 1: Submit Sample Spark Job

Use the reference PySpark job with observability pre-configured:

cd data-stacks/spark-on-eks/examples

# Replace <S3_BUCKET> with your bucket from Terraform outputs
export S3_BUCKET=$(kubectl get cm -n spark-team-a spark-s3-config -o jsonpath='{.data.bucket}' 2>/dev/null || echo "your-s3-bucket-name")

# Update S3 bucket placeholder
sed "s/\$S3_BUCKET/${S3_BUCKET}/g" pyspark-pi-job.yaml | kubectl apply -f -

Available Example Jobs:

pyspark-pi-job.yaml - Simple Pi calculation with metrics
spark-app-graviton.yaml - ARM-based Graviton workload
Additional examples in data-stacks/spark-on-eks/examples/

Step 2: Access Real-Time Spark Web UI

Monitor active jobs via Spark's native Web UI:

# Find running driver pod
kubectl get pods -n spark-team-a -l spark-role=driver

# Port-forward to driver pod
kubectl port-forward -n spark-team-a <driver-pod-name> 4040:4040

Open http://localhost:4040 to view:

Jobs Tab: DAG visualization and stage execution
Stages Tab: Task-level metrics (shuffle, I/O, GC)
Storage Tab: RDD/DataFrame caching statistics
Environment Tab: Spark configuration and JVM properties
Executors Tab: Per-executor resource utilization

Limitation: Spark UI terminates with the SparkContext. For post-execution analysis, use Spark History Server.

Spark History Server

Spark History Server replays event logs from S3 to provide persistent UI access after jobs complete.

Access History Server

# Port-forward to History Server service
kubectl port-forward -n spark-history-server svc/spark-history-server 18080:80

Open http://localhost:18080 to view:

All completed Spark applications
Full job execution timeline and DAG
Event log replay with same UI as live Spark UI
Historical resource utilization trends

Spark History Server

Event Log Retention: Configure spark.history.fs.cleaner.enabled and spark.history.fs.cleaner.maxAge to manage S3 storage costs.

Prometheus Metrics Collection

Verify Prometheus Scraping

Check if Prometheus is successfully scraping Spark metrics:

# Port-forward to Prometheus
kubectl port-forward -n prometheus svc/kube-prometheus-stack-prometheus 9090:9090

Navigate to http://localhost:9090 and query:

# Verify Spark driver metrics
spark_driver_jvm_heap_used{namespace="spark-team-a"}

# Executor task metrics
spark_executor_completedTasks_total

# GC time metrics
rate(spark_driver_jvm_gc_time[5m])

Prometheus Spark Metrics

Key Spark Metrics Exported

Driver Metrics:

spark_driver_jvm_heap_used - JVM heap utilization
spark_driver_DAGScheduler_stage_failedStages - Failed stages count
spark_driver_BlockManager_memory_memUsed_MB - Block manager memory

Executor Metrics:

spark_executor_completedTasks_total - Completed tasks per executor
spark_executor_runtime_jvm_heap_used - Executor JVM heap
spark_executor_filesystem_write_bytes - Shuffle/output write volume

Application Metrics:

spark_app_stages - Total stages
spark_app_tasks - Total tasks
spark_app_executors - Active executors count

Grafana Dashboards

Access Grafana

# Get Grafana admin password
kubectl get secret -n grafana grafana -o jsonpath="{.data.admin-password}" | base64 -d && echo

# Port-forward to Grafana
kubectl port-forward -n grafana svc/grafana 3000:80

Import Spark Dashboard

Import community dashboard ID: 7890 (Spark Application Monitoring):

Navigate to Dashboards → Import
Enter dashboard ID: 7890
Select Prometheus data source: Prometheus
Click Import

Spark Grafana Dashboard

Dashboard Panels:

Active/completed jobs timeline
Executor resource utilization (CPU, memory, disk)
Shuffle read/write performance
GC overhead and pause times
Task execution distribution

Custom Metrics Queries

Create custom panels with PromQL:

# Average task execution time per stage
avg(spark_stage_executor_runTime) by (stage_id)

# Shuffle spill to disk (performance bottleneck indicator)
sum(rate(spark_executor_shuffle_write_diskSpill_bytes[5m])) by (pod)

# Driver OOM risk indicator
(spark_driver_jvm_heap_used / spark_driver_jvm_heap_max) * 100 > 85

Production Best Practices

Event Log Optimization

# Rolling event logs to prevent large files
"spark.eventLog.rolling.enabled": "true"
"spark.eventLog.rolling.maxFileSize": "128m"  # Tune based on job size

# Compress event logs for S3 storage savings
"spark.eventLog.compress": "true"

Metrics Cardinality Management

# Limit executor metrics to prevent Prometheus overload
"spark.metrics.executorMetricsSource.enabled": "false"  # Disable if not needed

# Sample metrics collection (reduces overhead)
"spark.metrics.staticSources.enabled": "false"

Alerting Rules

Configure Prometheus AlertManager for critical events:

# Example: Spark job failure alert
- alert: SparkJobFailed
  expr: spark_driver_DAGScheduler_job_allJobs{status="failed"} > 0
  for: 1m
  annotations:
    summary: "Spark job {{ $labels.app_name }} failed"

Troubleshooting

Metrics Not Appearing in Prometheus

Check pod annotations:

kubectl get pods -n spark-team-a <driver-pod> -o yaml | grep -A5 annotations

Expected annotations:

prometheus.io/scrape: "true"
prometheus.io/path: "/metrics/driver/prometheus/"
prometheus.io/port: "4040"

Verify Prometheus targets:

Navigate to http://localhost:9090/targets
Search for spark-team-a namespace
Ensure targets are UP

History Server Shows No Applications

Check event log permissions:

# Verify IRSA role has S3 read access
kubectl describe sa -n spark-team-a spark-team-a

# Check S3 event log path
aws s3 ls s3://<S3_BUCKET>/spark-event-logs/

High Metrics Cardinality

Reduce executor-level metrics:

# Disable per-executor detailed metrics
"spark.metrics.conf.executor.sink.prometheusServlet.class": ""

Overview​

Prerequisites​

Spark Metrics Architecture​

PrometheusServlet Sink​

Metrics Configuration​

Event Log Configuration​

Hands-On: Deploy and Monitor Spark Jobs​

Step 1: Submit Sample Spark Job​

Step 2: Access Real-Time Spark Web UI​

Spark History Server​

Access History Server​

Prometheus Metrics Collection​

Verify Prometheus Scraping​

Key Spark Metrics Exported​

Grafana Dashboards​

Access Grafana​

Import Spark Dashboard​

Custom Metrics Queries​

Production Best Practices​

Event Log Optimization​

Metrics Cardinality Management​

Alerting Rules​

Troubleshooting​

Metrics Not Appearing in Prometheus​

History Server Shows No Applications​

High Metrics Cardinality​

Related Resources​

Overview

Prerequisites

Spark Metrics Architecture

PrometheusServlet Sink

Metrics Configuration

Event Log Configuration

Hands-On: Deploy and Monitor Spark Jobs

Step 1: Submit Sample Spark Job

Step 2: Access Real-Time Spark Web UI

Spark History Server

Access History Server

Prometheus Metrics Collection

Verify Prometheus Scraping

Key Spark Metrics Exported

Grafana Dashboards

Access Grafana

Import Spark Dashboard

Custom Metrics Queries

Production Best Practices

Event Log Optimization

Metrics Cardinality Management

Alerting Rules

Troubleshooting

Metrics Not Appearing in Prometheus

History Server Shows No Applications

High Metrics Cardinality

Related Resources