Skip to main content

Spark with EBS Dynamic PVC Storage

Learn to use EBS Dynamic PVC for Spark shuffle storage with automatic provisioning and data recovery.

Architecture: PVC Reuse & Fault Tolerance

Key Benefits:

  • 🎯 Driver Ownership: Driver pod owns all PVCs for centralized management
  • ♻️ PVC Reuse: Failed executors reuse existing PVCs with preserved shuffle data
  • Faster Recovery: No volume provisioning delay during executor restart
  • 💰 Cost Efficient: Reuses EBS volumes instead of creating new ones

PVC Reuse Flow

Prerequisites

  • Deploy Spark on EKS infrastructure: Infrastructure Setup
  • EBS CSI Controller running with storage class gp2 or gp3 for dynamic volume creation
EBS CSI Requirement

This example requires the EBS CSI driver to dynamically create volumes for Spark jobs. Ensure your cluster has the EBS CSI controller deployed with appropriate storage classes.

What is Shuffle Storage in Spark?

Shuffle storage holds intermediate data during Spark operations like groupBy, join, and reduceByKey. When data is redistributed across executors, it's temporarily stored before being read by subsequent stages.

Spark Shuffle Storage Options

Storage TypePerformanceCostUse Case
NVMe SSD Instances🔥 Very High💰 HighMaximum performance workloads
EBS Dynamic PVC⚡ High💰 MediumFeatured - Production fault tolerance
EBS Node Storage📊 Medium💵 MediumShared volume per node
FSx for Lustre📊 Medium💵 LowParallel filesystem for HPC
S3 Express + Mountpoint📊 Medium💵 LowVery large datasets
Remote Shuffle (Celeborn)⚡ High💰 MediumResource disaggregation

Benefits: Performance & Cost

  • NVMe: Fastest local SSD storage, highest cost per GB
  • EBS Dynamic PVC: Balance of performance and cost with fault tolerance
  • EBS Node Storage: Cost-effective shared volumes
  • FSx/S3 Express: Cost-optimized for large-scale processing

Example Code

View the complete configuration:

📄 Complete EBS Dynamic PVC Configuration
examples/ebs-storage-dynamic-pvc.yaml
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh

# This example supports the following features
# Support shuffle data recovery on the reused PVCs (SPARK-35593)
# Support driver-owned on-demand PVC (SPARK-35182)
# WARNING: spark-operator cluster role is missing a 'persistenvolumeclaims' permission. Ensure you add this permission to spark-operator cluster role

---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "taxi-trip-ebs-pvc"
namespace: spark-team-a
labels:
app: "taxi-trip-ebs-pvc"
queue: root.test
spec:
# To create Ingress object for Spark driver.
# Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
# sparkUIOptions:
# servicePort: 4040
# servicePortName: taxi-trip-ui-svc
# serviceType: ClusterIP
# ingressAnnotations:
# kubernetes.io/ingress.class: nginx
# nginx.ingress.kubernetes.io/use-regex: "true"
type: Python
sparkVersion: "3.5.3"
mode: cluster
image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
imagePullPolicy: IfNotPresent
mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py" # MainFile is the path to a bundled JAR, Python, or R file of the application
arguments:
- "s3a://$S3_BUCKET/taxi-trip/input/"
- "s3a://$S3_BUCKET/taxi-trip/output/"
sparkConf:
"spark.app.name": "taxi-trip-ebs-pvc"
"spark.kubernetes.driver.pod.name": "taxi-trip-ebs-pvc"
"spark.kubernetes.executor.podNamePrefix": "taxi-trip-ebs-pvc"
"spark.local.dir": "/data1"
"spark.speculation": "false"
"spark.network.timeout": "2400"
"spark.hadoop.fs.s3a.connection.timeout": "1200000"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.connection.maximum": "200"
"spark.hadoop.fs.s3a.fast.upload": "true"
"spark.hadoop.fs.s3a.readahead.range": "256K"
"spark.hadoop.fs.s3a.input.fadvise": "random"
"spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
"spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider" # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
"spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"

# Spark Event logs
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
"spark.eventLog.rolling.enabled": "true"
"spark.eventLog.rolling.maxFileSize": "64m"
# "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100

# Expose Spark metrics for Prometheus
"spark.ui.prometheus.enabled": "true"
"spark.executor.processTreeMetrics.enabled": "true"
"spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
"spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
"spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"

# EBS Dynamic PVC Config
# You can mount a dynamically-created persistent volume claim per executor by using OnDemand as a claim name and storageClass and sizeLimit options like the following. This is useful in case of Dynamic Allocation.
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"

"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"

# Support shuffle data recovery on the reused PVCs (SPARK-35593)
# If true, driver pod becomes the owner of on-demand persistent volume claims instead of the executor pods
"spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
# If true, driver pod tries to reuse driver-owned on-demand persistent volume claims of the deleted executor pods if exists.
# This can be useful to reduce executor pod creation delay by skipping persistent volume creations.
# Note that a pod in `Terminating` pod status is not a deleted pod by definition and its resources including persistent volume claims are not reusable yet. Spark will create new persistent volume claims when there exists no reusable one.
# In other words, the total number of persistent volume claims can be larger than the number of running executors sometimes.
# This config requires spark.kubernetes.driver.ownPersistentVolumeClaim=true.
"spark.kubernetes.driver.reusePersistentVolumeClaim": "true" #

restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20

driver:
initContainers:
- name: volume-permissions
image: public.ecr.aws/y4g4v0z7/busybox
command: [ 'sh', '-c', 'chown -R 185 /data1' ]
volumeMounts:
- mountPath: "/data1"
name: "spark-local-dir-1"
cores: 1
coreLimit: "1200m"
memory: "4g"
memoryOverhead: "4g"
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
karpenter.sh/capacity-type: "on-demand"
tolerations:
- key: "spark-compute-optimized"
operator: "Exists"
effect: "NoSchedule"
executor:
initContainers:
- name: volume-permissions
image: public.ecr.aws/y4g4v0z7/busybox
command: [ 'sh', '-c', 'chown -R 185 /data1' ]
volumeMounts:
- mountPath: "/data1"
name: "spark-local-dir-1"
cores: 1
coreLimit: "1200m"
instances: 2
memory: "4g"
memoryOverhead: "4g"
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
tolerations:
- key: "spark-compute-optimized"
operator: "Exists"
effect: "NoSchedule"

EBS Dynamic PVC Configuration

Key configuration for dynamic PVC provisioning:

Essential Dynamic PVC Settings
sparkConf:
# Dynamic PVC creation - Driver
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"

# Dynamic PVC creation - Executor
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"

# PVC ownership and reuse for fault tolerance
"spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
"spark.kubernetes.driver.reusePersistentVolumeClaim": "true"

Features:

  • OnDemand: Automatically creates PVCs per pod
  • gp3: EBS GP3 storage class (default, better price/performance than GP2)
  • 100Gi: Storage size per volume (optimized for example workload)
  • Driver ownership enables PVC reuse for fault tolerance

Create Test Data and Run Example

Process NYC taxi data to demonstrate EBS Dynamic PVC with shuffle operations.

1. Prepare Test Data

cd data-stacks/spark-on-eks/terraform/_local/

# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)

# Navigate to scripts directory and create test data
cd ../../scripts/
./taxi-trip-execute.sh $S3_BUCKET $REGION

Downloads NYC taxi data (1.1GB total) and uploads to S3

2. Execute Spark Job

# Navigate to examples directory
cd ../examples/

# Submit the EBS Dynamic PVC job
envsubst < ebs-storage-dynamic-pvc.yaml | kubectl apply -f -

# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch

Expected output:

NAME       STATUS    ATTEMPTS   START                  FINISH                 AGE
taxi-trip COMPLETED 1 2025-09-28T17:03:31Z 2025-09-28T17:08:15Z 4m44s

Verify Data and Pods

Monitor PVC Creation

# Watch PVC creation in real-time
kubectl get pvc -n spark-team-a --watch

# Expected PVCs
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
taxi-trip-b64d669992344315-driver-pvc-0 Bound pvc-e891b472-249f-44d9-a9ce-6ab4c3a9a488 100Gi RWO gp3 <unset> 3m34s
taxi-trip-exec-1-pvc-0 Bound pvc-ae09b08b-8a5a-4892-a9ab-9d6ff2ceb6df 100Gi RWO gp3 <unset> 114s
taxi-trip-exec-2-pvc-0 Bound pvc-7a2b4e76-5ab6-435e-989e-2978618a2877 100Gi RWO gp3 <unset> 114s

Check Pod Status and Storage

# Check driver and executor pods
kubectl get pods -n spark-team-a -l app=taxi-trip

# Check volume usage inside pods
kubectl exec -n spark-team-a taxi-trip-driver -- df -h /data1

# View Spark application logs
kubectl logs -n spark-team-a -l spark-role=driver --follow

Verify Output Data

# Check processed output in S3
aws s3 ls s3://$S3_BUCKET/taxi-trip/output/

# Verify event logs
aws s3 ls s3://$S3_BUCKET/spark-event-logs/

Cleanup

# Delete the Spark application
kubectl delete sparkapplication taxi-trip -n spark-team-a

# Check if PVCs are retained (they should be for reuse)
kubectl get pvc -n spark-team-a

# Optional: Delete PVCs if no longer needed
kubectl delete pvc -n spark-team-a --all

Benefits

  • Automatic PVC Management: No manual volume creation
  • Fault Tolerance: Shuffle data survives executor restarts
  • Cost Optimization: Dynamic sizing and reuse
  • Performance: Faster startup with PVC reuse

Storage Class Options

# GP3 - Better price/performance
storageClass: "gp3"

# IO1 - High IOPS workloads
storageClass: "io1"

Next Steps