Spark with EBS Dynamic PVC Storage

Learn to use EBS Dynamic PVC for Spark shuffle storage with automatic provisioning and data recovery.

Architecture: PVC Reuse & Fault Tolerance

Key Benefits:

🎯 Driver Ownership: Driver pod owns all PVCs for centralized management
♻️ PVC Reuse: Failed executors reuse existing PVCs with preserved shuffle data
⚡ Faster Recovery: No volume provisioning delay during executor restart
💰 Cost Efficient: Reuses EBS volumes instead of creating new ones

PVC Reuse Flow

Prerequisites

Deploy Spark on EKS infrastructure: Infrastructure Setup
EBS CSI Controller running with storage class gp2 or gp3 for dynamic volume creation

EBS CSI Requirement

This example requires the EBS CSI driver to dynamically create volumes for Spark jobs. Ensure your cluster has the EBS CSI controller deployed with appropriate storage classes.

What is Shuffle Storage in Spark?

Shuffle storage holds intermediate data during Spark operations like groupBy, join, and reduceByKey. When data is redistributed across executors, it's temporarily stored before being read by subsequent stages.

Spark Shuffle Storage Options

Storage Type	Performance	Cost	Use Case
NVMe SSD Instances	🔥 Very High	💰 High	Maximum performance workloads
EBS Dynamic PVC	⚡ High	💰 Medium	Featured - Production fault tolerance
EBS Node Storage	📊 Medium	💵 Medium	Shared volume per node
FSx for Lustre	📊 Medium	💵 Low	Parallel filesystem for HPC
S3 Express + Mountpoint	📊 Medium	💵 Low	Very large datasets
Remote Shuffle (Celeborn)	⚡ High	💰 Medium	Resource disaggregation

Benefits: Performance & Cost

NVMe: Fastest local SSD storage, highest cost per GB
EBS Dynamic PVC: Balance of performance and cost with fault tolerance
EBS Node Storage: Cost-effective shared volumes
FSx/S3 Express: Cost-optimized for large-scale processing

Example Code

View the complete configuration:

📄 Complete EBS Dynamic PVC Configuration

examples/ebs-storage-dynamic-pvc.yaml
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh

# This example supports the following features
  # Support shuffle data recovery on the reused PVCs (SPARK-35593)
  # Support driver-owned on-demand PVC (SPARK-35182)
# WARNING: spark-operator cluster role is missing a 'persistenvolumeclaims' permission. Ensure you add this permission to spark-operator cluster role

---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: "taxi-trip-ebs-pvc"
  namespace: spark-team-a
  labels:
    app: "taxi-trip-ebs-pvc"
    queue: root.test
spec:
#  To create Ingress object for Spark driver.
#  Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
#  sparkUIOptions:
#    servicePort: 4040
#    servicePortName: taxi-trip-ui-svc
#    serviceType: ClusterIP
#    ingressAnnotations:
#      kubernetes.io/ingress.class: nginx
#      nginx.ingress.kubernetes.io/use-regex: "true"
  type: Python
  sparkVersion: "3.5.3"
  mode: cluster
  image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
  imagePullPolicy: IfNotPresent
  mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py"  # MainFile is the path to a bundled JAR, Python, or R file of the application
  arguments:
    - "s3a://$S3_BUCKET/taxi-trip/input/"
    - "s3a://$S3_BUCKET/taxi-trip/output/"
  sparkConf:
    "spark.app.name": "taxi-trip-ebs-pvc"
    "spark.kubernetes.driver.pod.name": "taxi-trip-ebs-pvc"
    "spark.kubernetes.executor.podNamePrefix": "taxi-trip-ebs-pvc"
    "spark.local.dir": "/data1"
    "spark.speculation": "false"
    "spark.network.timeout": "2400"
    "spark.hadoop.fs.s3a.connection.timeout": "1200000"
    "spark.hadoop.fs.s3a.path.style.access": "true"
    "spark.hadoop.fs.s3a.connection.maximum": "200"
    "spark.hadoop.fs.s3a.fast.upload": "true"
    "spark.hadoop.fs.s3a.readahead.range": "256K"
    "spark.hadoop.fs.s3a.input.fadvise": "random"
    "spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
    "spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"  # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
    "spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"

    # Spark Event logs
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
    "spark.eventLog.rolling.enabled": "true"
    "spark.eventLog.rolling.maxFileSize": "64m"
    # "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100

    # Expose Spark metrics for Prometheus
    "spark.ui.prometheus.enabled": "true"
    "spark.executor.processTreeMetrics.enabled": "true"
    "spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
    "spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
    "spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"

    # EBS Dynamic PVC Config
    # You can mount a dynamically-created persistent volume claim per executor by using OnDemand as a claim name and storageClass and sizeLimit options like the following. This is useful in case of Dynamic Allocation.
    "spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
    "spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
    "spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
    "spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"
    "spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"

    "spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"
    "spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly": "false"

    # Support shuffle data recovery on the reused PVCs (SPARK-35593)
    # If true, driver pod becomes the owner of on-demand persistent volume claims instead of the executor pods
    "spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
    # If true, driver pod tries to reuse driver-owned on-demand persistent volume claims of the deleted executor pods if exists.
    #    This can be useful to reduce executor pod creation delay by skipping persistent volume creations.
    #    Note that a pod in `Terminating` pod status is not a deleted pod by definition and its resources including persistent volume claims are not reusable yet. Spark will create new persistent volume claims when there exists no reusable one.
    #    In other words, the total number of persistent volume claims can be larger than the number of running executors sometimes.
    #    This config requires spark.kubernetes.driver.ownPersistentVolumeClaim=true.
    "spark.kubernetes.driver.reusePersistentVolumeClaim": "true" #

  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20

  driver:
    initContainers:
      - name: volume-permissions
        image: public.ecr.aws/y4g4v0z7/busybox
        command: [ 'sh', '-c', 'chown -R 185 /data1' ]
        volumeMounts:
          - mountPath: "/data1"
            name: "spark-local-dir-1"
    cores: 1
    coreLimit: "1200m"
    memory: "4g"
    memoryOverhead: "4g"
    serviceAccount: spark-team-a
    labels:
      version: 3.5.3
    nodeSelector:
      node.kubernetes.io/workload-type: "compute-optimized-x86"
      karpenter.sh/capacity-type: "on-demand"
    tolerations:
      - key: "spark-compute-optimized"
        operator: "Exists"
        effect: "NoSchedule"
  executor:
    initContainers:
      - name: volume-permissions
        image: public.ecr.aws/y4g4v0z7/busybox
        command: [ 'sh', '-c', 'chown -R 185 /data1' ]
        volumeMounts:
          - mountPath: "/data1"
            name: "spark-local-dir-1"
    cores: 1
    coreLimit: "1200m"
    instances: 2
    memory: "4g"
    memoryOverhead: "4g"
    serviceAccount: spark-team-a
    labels:
      version: 3.5.3
    nodeSelector:
      node.kubernetes.io/workload-type: "compute-optimized-x86"
    tolerations:
      - key: "spark-compute-optimized"
        operator: "Exists"
        effect: "NoSchedule"

EBS Dynamic PVC Configuration

Key configuration for dynamic PVC provisioning:

Essential Dynamic PVC Settings
sparkConf:
  # Dynamic PVC creation - Driver
  "spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
  "spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
  "spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
  "spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"

  # Dynamic PVC creation - Executor
  "spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName": "OnDemand"
  "spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass": "gp3"
  "spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit": "100Gi"
  "spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path": "/data1"

  # PVC ownership and reuse for fault tolerance
  "spark.kubernetes.driver.ownPersistentVolumeClaim": "true"
  "spark.kubernetes.driver.reusePersistentVolumeClaim": "true"

Features:

OnDemand: Automatically creates PVCs per pod
gp3: EBS GP3 storage class (default, better price/performance than GP2)
100Gi: Storage size per volume (optimized for example workload)
Driver ownership enables PVC reuse for fault tolerance

Create Test Data and Run Example

Process NYC taxi data to demonstrate EBS Dynamic PVC with shuffle operations.

1. Prepare Test Data

cd data-stacks/spark-on-eks/terraform/_local/

# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)

# Navigate to scripts directory and create test data
cd ../../scripts/
./taxi-trip-execute.sh $S3_BUCKET $REGION

Downloads NYC taxi data (1.1GB total) and uploads to S3

2. Execute Spark Job

# Navigate to examples directory
cd ../examples/

# Submit the EBS Dynamic PVC job
envsubst < ebs-storage-dynamic-pvc.yaml | kubectl apply -f -

# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch

Expected output:

NAME       STATUS    ATTEMPTS   START                  FINISH                 AGE
taxi-trip  COMPLETED 1          2025-09-28T17:03:31Z   2025-09-28T17:08:15Z   4m44s

Verify Data and Pods

Monitor PVC Creation

# Watch PVC creation in real-time
kubectl get pvc -n spark-team-a --watch

# Expected PVCs
NAME                                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
taxi-trip-b64d669992344315-driver-pvc-0   Bound    pvc-e891b472-249f-44d9-a9ce-6ab4c3a9a488   100Gi      RWO            gp3            <unset>                 3m34s
taxi-trip-exec-1-pvc-0                    Bound    pvc-ae09b08b-8a5a-4892-a9ab-9d6ff2ceb6df   100Gi      RWO            gp3            <unset>                 114s
taxi-trip-exec-2-pvc-0                    Bound    pvc-7a2b4e76-5ab6-435e-989e-2978618a2877   100Gi      RWO            gp3            <unset>                 114s

Check Pod Status and Storage

# Check driver and executor pods
kubectl get pods -n spark-team-a -l app=taxi-trip

# Check volume usage inside pods
kubectl exec -n spark-team-a taxi-trip-driver -- df -h /data1

# View Spark application logs
kubectl logs -n spark-team-a -l spark-role=driver --follow

Verify Output Data

# Check processed output in S3
aws s3 ls s3://$S3_BUCKET/taxi-trip/output/

# Verify event logs
aws s3 ls s3://$S3_BUCKET/spark-event-logs/

Cleanup

# Delete the Spark application
kubectl delete sparkapplication taxi-trip -n spark-team-a

# Check if PVCs are retained (they should be for reuse)
kubectl get pvc -n spark-team-a

# Optional: Delete PVCs if no longer needed
kubectl delete pvc -n spark-team-a --all

Benefits

Automatic PVC Management: No manual volume creation
Fault Tolerance: Shuffle data survives executor restarts
Cost Optimization: Dynamic sizing and reuse
Performance: Faster startup with PVC reuse

Storage Class Options

# GP3 - Better price/performance
storageClass: "gp3"

# IO1 - High IOPS workloads
storageClass: "io1"

Next Steps

NVMe Instance Storage - High-performance local SSD

Architecture: PVC Reuse & Fault Tolerance​

PVC Reuse Flow​

Prerequisites​

What is Shuffle Storage in Spark?​

Spark Shuffle Storage Options​

Benefits: Performance & Cost​

Example Code​

EBS Dynamic PVC Configuration​

Create Test Data and Run Example​

1. Prepare Test Data​

2. Execute Spark Job​

Verify Data and Pods​

Monitor PVC Creation​

Check Pod Status and Storage​

Verify Output Data​

Cleanup​

Benefits​

Storage Class Options​

Next Steps​