Spark with EBS Dynamic PVC Storage
Learn to use EBS Dynamic PVC for Spark shuffle storage with automatic provisioning and data recovery.
Architecture: PVC Reuse & Fault Tolerance
Key Benefits:
- 🎯 Driver Ownership: Driver pod owns all PVCs for centralized management
- ♻️ PVC Reuse: Failed executors reuse existing PVCs with preserved shuffle data
- ⚡ Faster Recovery: No volume provisioning delay during executor restart
- 💰 Cost Efficient: Reuses EBS volumes instead of creating new ones
PVC Reuse Flow
Prerequisites
- Deploy Spark on EKS infrastructure: Infrastructure Setup
- EBS CSI Controller running with storage class
gp2orgp3for dynamic volume creation
EBS CSI Requirement
This example requires the EBS CSI driver to dynamically create volumes for Spark jobs. Ensure your cluster has the EBS CSI controller deployed with appropriate storage classes.
What is Shuffle Storage in Spark?
Shuffle storage holds intermediate data during Spark operations like groupBy, join, and reduceByKey. When data is redistributed across executors, it's temporarily stored before being read by subsequent stages.
Spark Shuffle Storage Options
| Storage Type | Performance | Cost | Use Case |
|---|---|---|---|
| NVMe SSD Instances | 🔥 Very High | 💰 High | Maximum performance workloads |
| EBS Dynamic PVC | ⚡ High | 💰 Medium | Featured - Production fault tolerance |
| EBS Node Storage | 📊 Medium | 💵 Medium | Shared volume per node |
| FSx for Lustre | 📊 Medium | 💵 Low | Parallel filesystem for HPC |
| S3 Express + Mountpoint | 📊 Medium | 💵 Low | Very large datasets |
| Remote Shuffle (Celeborn) | ⚡ High | 💰 Medium | Resource disaggregation |
Benefits: Performance & Cost
- NVMe: Fastest local SSD storage, highest cost per GB
- EBS Dynamic PVC: Balance of performance and cost with fault tolerance
- EBS Node Storage: Cost-effective shared volumes
- FSx/S3 Express: Cost-optimized for large-scale processing
Example Code
View the complete configuration:
📄 Complete EBS Dynamic PVC Configuration
examples/ebs-storage-dynamic-pvc.yaml
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh
# This example supports the following features
# Support shuffle data recovery on the reused PVCs (SPARK-35593)
# Support driver-owned on-demand PVC (SPARK-35182)
# WARNING: spark-operator cluster role is missing a 'persistenvolumeclaims' permission. Ensure you add this permission to spark-operator cluster role
---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "taxi-trip-ebs-pvc"
namespace: spark-team-a
labels:
app: "taxi-trip-ebs-pvc"
queue: root.test
spec:
# To create Ingress object for Spark driver.
# Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
# sparkUIOptions:
# servicePort: 4040
# servicePortName: taxi-trip-ui-svc
# serviceType: ClusterIP
# ingressAnnotations:
# kubernetes.io/ingress.class: nginx
# nginx.ingress.kubernetes.io/use-regex: "true"
type: Python
sparkVersion: "3.5.3"
mode: cluster
image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
imagePullPolicy: IfNotPresent
mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py" # MainFile is the path to a bundled JAR, Python, or R file of the application
arguments:
- "s3a://$S3_BUCKET/taxi-trip/input/"
- "s3a://$S3_BUCKET/taxi-trip/output/"
sparkConf:
"spark.app.name": "taxi-trip-ebs-pvc"
"spark.kubernetes.driver.pod.name": "taxi-trip-ebs-pvc"
"spark.kubernetes.executor.podNamePrefix": "taxi-trip-ebs-pvc"
"spark.local.dir": "/data1"
"spark.speculation": "false"
"spark.network.timeout": "2400"
"spark.hadoop.fs.s3a.connection.timeout": "1200000"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.connection.maximum": "200"
"spark.hadoop.fs.s3a.fast.upload": "true"
"spark.hadoop.fs.s3a.readahead.range": "256K"
"spark.hadoop.fs.s3a.input.fadvise": "random"
"spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
"spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider" # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
"spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
# Spark Event logs
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
"spark.eventLog.rolling.enabled": "true"
"spark.eventLog.rolling.maxFileSize": "64m"
# "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100
# Expose Spark metrics for Prometheus
"spark.ui.prometheus.enabled": "true"
"spark.executor.processTreeMetrics.enabled": "true"
"spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
"spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
"spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"
# EBS Dynamic PVC Config
# You can mount a dynamically-created persistent volume claim per executor by using OnDemand as a claim name and storageClass and sizeLimit options like the following. This is useful in case of Dynamic Allocation.
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"
# Support shuffle data recovery on the reused PVCs (SPARK-35593)
# If true, driver pod becomes the owner of on-demand persistent volume claims instead of the executor pods
"spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
# If true, driver pod tries to reuse driver-owned on-demand persistent volume claims of the deleted executor pods if exists.
# This can be useful to reduce executor pod creation delay by skipping persistent volume creations.
# Note that a pod in `Terminating` pod status is not a deleted pod by definition and its resources including persistent volume claims are not reusable yet. Spark will create new persistent volume claims when there exists no reusable one.
# In other words, the total number of persistent volume claims can be larger than the number of running executors sometimes.
# This config requires spark.kubernetes.driver.ownPersistentVolumeClaim=true.
"spark.kubernetes.driver.reusePersistentVolumeClaim": "true" #
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
initContainers:
- name: volume-permissions
image: public.ecr.aws/y4g4v0z7/busybox
command: [ 'sh', '-c', 'chown -R 185 /data1' ]
volumeMounts:
- mountPath: "/data1"
name: "spark-local-dir-1"
cores: 1
coreLimit: "1200m"
memory: "4g"
memoryOverhead: "4g"
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
karpenter.sh/capacity-type: "on-demand"
tolerations:
- key: "spark-compute-optimized"
operator: "Exists"
effect: "NoSchedule"
executor:
initContainers:
- name: volume-permissions
image: public.ecr.aws/y4g4v0z7/busybox
command: [ 'sh', '-c', 'chown -R 185 /data1' ]
volumeMounts:
- mountPath: "/data1"
name: "spark-local-dir-1"
cores: 1
coreLimit: "1200m"
instances: 2
memory: "4g"
memoryOverhead: "4g"
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
tolerations:
- key: "spark-compute-optimized"
operator: "Exists"
effect: "NoSchedule"
EBS Dynamic PVC Configuration
Key configuration for dynamic PVC provisioning:
Essential Dynamic PVC Settings
sparkConf:
# Dynamic PVC creation - Driver
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
"spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"
# Dynamic PVC creation - Executor
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"
# PVC ownership and reuse for fault tolerance
"spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
"spark.kubernetes.driver.reusePersistentVolumeClaim": "true"
Features:
OnDemand: Automatically creates PVCs per podgp3: EBS GP3 storage class (default, better price/performance than GP2)100Gi: Storage size per volume (optimized for example workload)- Driver ownership enables PVC reuse for fault tolerance
Create Test Data and Run Example
Process NYC taxi data to demonstrate EBS Dynamic PVC with shuffle operations.
1. Prepare Test Data
cd data-stacks/spark-on-eks/terraform/_local/
# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)
# Navigate to scripts directory and create test data
cd ../../scripts/
./taxi-trip-execute.sh $S3_BUCKET $REGION
Downloads NYC taxi data (1.1GB total) and uploads to S3
2. Execute Spark Job
# Navigate to examples directory
cd ../examples/
# Submit the EBS Dynamic PVC job
envsubst < ebs-storage-dynamic-pvc.yaml | kubectl apply -f -
# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch
Expected output:
NAME STATUS ATTEMPTS START FINISH AGE
taxi-trip COMPLETED 1 2025-09-28T17:03:31Z 2025-09-28T17:08:15Z 4m44s
Verify Data and Pods
Monitor PVC Creation
# Watch PVC creation in real-time
kubectl get pvc -n spark-team-a --watch
# Expected PVCs
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE
taxi-trip-b64d669992344315-driver-pvc-0 Bound pvc-e891b472-249f-44d9-a9ce-6ab4c3a9a488 100Gi RWO gp3 <unset> 3m34s
taxi-trip-exec-1-pvc-0 Bound pvc-ae09b08b-8a5a-4892-a9ab-9d6ff2ceb6df 100Gi RWO gp3 <unset> 114s
taxi-trip-exec-2-pvc-0 Bound pvc-7a2b4e76-5ab6-435e-989e-2978618a2877 100Gi RWO gp3 <unset> 114s
Check Pod Status and Storage
# Check driver and executor pods
kubectl get pods -n spark-team-a -l app=taxi-trip
# Check volume usage inside pods
kubectl exec -n spark-team-a taxi-trip-driver -- df -h /data1
# View Spark application logs
kubectl logs -n spark-team-a -l spark-role=driver --follow
Verify Output Data
# Check processed output in S3
aws s3 ls s3://$S3_BUCKET/taxi-trip/output/
# Verify event logs
aws s3 ls s3://$S3_BUCKET/spark-event-logs/
Cleanup
# Delete the Spark application
kubectl delete sparkapplication taxi-trip -n spark-team-a
# Check if PVCs are retained (they should be for reuse)
kubectl get pvc -n spark-team-a
# Optional: Delete PVCs if no longer needed
kubectl delete pvc -n spark-team-a --all
Benefits
- Automatic PVC Management: No manual volume creation
- Fault Tolerance: Shuffle data survives executor restarts
- Cost Optimization: Dynamic sizing and reuse
- Performance: Faster startup with PVC reuse
Storage Class Options
# GP3 - Better price/performance
storageClass: "gp3"
# IO1 - High IOPS workloads
storageClass: "io1"
Next Steps
- NVMe Instance Storage - High-performance local SSD