Skip to main content

Spark with EBS Node-Level Storage

Learn to use shared EBS volumes per node for Spark shuffle storage - a cost-effective alternative to per-pod PVCs.

Prerequisites

  • Deploy Spark on EKS infrastructure: Infrastructure Setup
  • Existing EBS volume (200Gi GP3) used with dedicated Spark directory
Node Storage Considerations

This approach shares one EBS volume per node among all Spark pods. While cost-effective, it may have noisy neighbor issues when multiple workloads compete for the same storage.

Architecture: Shared EBS Volume per Node

Key Benefits:

  • 💰 Cost Effective: One EBS volume per node vs per pod
  • Higher Performance: Direct node storage access
  • 🔄 Simplified Management: No PVC lifecycle complexity
  • ⚠️ Trade-off: Potential noisy neighbor issues

What is Shuffle Storage in Spark?

Shuffle storage holds intermediate data during Spark operations like groupBy, join, and reduceByKey. When data is redistributed across executors, it's temporarily stored before being read by subsequent stages.

Spark Shuffle Storage Options

Storage TypePerformanceCostUse Case
NVMe SSD Instances🔥 Very High💰 HighMaximum performance workloads
EBS Node Storage⚡ High💵 MediumFeatured - Cost-effective shared storage
EBS Dynamic PVC📊 Medium💰 MediumPer-pod isolation and fault tolerance
FSx for Lustre📊 Medium💵 LowParallel filesystem for HPC
S3 Express + Mountpoint📊 Medium💵 LowVery large datasets
Remote Shuffle (Celeborn)⚡ High💰 MediumResource disaggregation

Benefits: Performance & Cost

  • EBS Node Storage: Balance of performance and cost with shared volumes
  • Higher throughput: Direct access without Kubernetes volume overhead
  • Cost optimization: Fewer EBS volumes than per-pod approach

Example Code

View the complete configuration:

📄 Complete EBS Node Storage Configuration
examples/ebs-node-storage.yaml
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh

# This example demonstrates EBS Node-Level Storage features
# Shared EBS volume per node (mounted at /mnt/spark-local)
# Cost-effective alternative to per-pod PVCs
# Higher performance than individual PVCs but potential noisy neighbor issues

---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "taxi-trip"
namespace: spark-team-a
labels:
app: "taxi-trip"
applicationId: "taxi-trip-ebs"
queue: root.test
spec:
# To create Ingress object for Spark driver.
# Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
# sparkUIOptions:
# servicePort: 4040
# servicePortName: taxi-trip-ui-svc
# serviceType: ClusterIP
# ingressAnnotations:
# kubernetes.io/ingress.class: nginx
# nginx.ingress.kubernetes.io/use-regex: "true"
type: Python
sparkVersion: "3.5.3"
mode: cluster
image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
imagePullPolicy: IfNotPresent
mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py" # MainFile is the path to a bundled JAR, Python, or R file of the application
arguments:
- "s3a://$S3_BUCKET/taxi-trip/input/"
- "s3a://$S3_BUCKET/taxi-trip/output/"
sparkConf:
"spark.app.name": "taxi-trip"
"spark.kubernetes.driver.pod.name": "taxi-trip"
"spark.kubernetes.executor.podNamePrefix": "taxi-trip"
"spark.local.dir": "/data1"
"spark.speculation": "false"
"spark.network.timeout": "2400"
"spark.hadoop.fs.s3a.connection.timeout": "1200000"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.connection.maximum": "200"
"spark.hadoop.fs.s3a.fast.upload": "true"
"spark.hadoop.fs.s3a.readahead.range": "256K"
"spark.hadoop.fs.s3a.input.fadvise": "random"
"spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
"spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider" # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
"spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"

# Spark Event logs
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
"spark.eventLog.rolling.enabled": "true"
"spark.eventLog.rolling.maxFileSize": "64m"
# "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100

# Expose Spark metrics for Prometheus
"spark.ui.prometheus.enabled": "true"
"spark.executor.processTreeMetrics.enabled": "true"
"spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
"spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
"spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"

# EBS Node-Level Storage Configuration
# Use node-level EBS volume mounted at /mnt/spark-local
# This provides shared storage per node, reducing costs compared to per-pod PVCs
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/spark-local"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/spark-local"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20

driver:
cores: 1
coreLimit: "1200m"
memory: "4g"
memoryOverhead: "4g"
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
karpenter.sh/capacity-type: "on-demand"
tolerations:
- key: "spark-compute-optimized"
operator: "Exists"
effect: "NoSchedule"
executor:
cores: 1
coreLimit: "1200m"
instances: 2
memory: "4g"
memoryOverhead: "4g"
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
tolerations:
- key: "spark-compute-optimized"
operator: "Exists"
effect: "NoSchedule"

EBS Node Storage Configuration

Key configuration for shared node-level storage:

Essential Node Storage Settings
sparkConf:
# Node-level EBS volume - Driver
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/spark-local"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

# Node-level EBS volume - Executor
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/spark-local"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

Features:

  • hostPath: Uses node-level directory on root volume
  • /mnt/spark-local: Shared directory per node (200Gi GP3 root volume)
  • Directory: Ensures directory exists on the node
  • All pods on same node share the storage

Create Test Data and Run Example

Process NYC taxi data to demonstrate EBS node-level storage with shared volumes.

1. Prepare Test Data

cd data-stacks/spark-on-eks/terraform/_local/

# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)

# Navigate to scripts directory and create test data
cd ../../scripts/
./taxi-trip-execute.sh $S3_BUCKET $REGION

Downloads NYC taxi data (1.1GB total) and uploads to S3

2. Execute Spark Job

# Navigate to example directory
cd ../examples/

# Submit the EBS Node Storage job
envsubst < ebs-node-storage.yaml | kubectl apply -f -

# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch

Expected output:

NAME       STATUS    ATTEMPTS   START                  FINISH                 AGE
taxi-trip COMPLETED 1 2025-09-28T17:03:31Z 2025-09-28T17:08:15Z 4m44s

Verify Data and Storage

Monitor Node Storage Usage

# Check which nodes have Spark pods
kubectl get pods -n spark-team-a -o wide

# SSH to nodes and check storage usage (if needed)
kubectl debug node/<node-name> -it --image=busybox -- df -h /host/

# Check directory structure on nodes
kubectl debug node/<node-name> -it --image=busybox -- ls -la /host/mnt/spark-local/

Check Pod Status and Storage

# Check driver and executor pods
kubectl get pods -n spark-team-a -l app=taxi-trip

# Verify node storage is mounted correctly (/dev/nvme0n1p1 at /data1)
kubectl exec -n spark-team-a taxi-trip-exec-1 -- df -h

# Expected output:
# /dev/nvme0n1p1 200G 7.9G 193G 4% /data1

# Check Spark shuffle data directories (multiple blockmgr per node)
kubectl exec -n spark-team-a taxi-trip-exec-1 -- ls -la /data1/

# Expected output shows shared storage with multiple block managers:
# drwxr-xr-x. 22 spark spark 16384 Sep 28 22:09 blockmgr-7c0ac908-26a3-4395-8a8f-2221b4d5d7c3
# drwxr-xr-x. 13 spark spark 116 Sep 28 22:09 blockmgr-9ed9c2fd-53e1-4337-8a68-9a48e1e63d5f
# drwxr-xr-x. 13 spark spark 116 Sep 28 22:09 blockmgr-ecc9fa35-a82e-4486-85fe-c8ef963d6eb7

# Verify shared storage between pods on same node
kubectl exec -n spark-team-a taxi-trip-driver -- ls -la /data1/

# View Spark application logs
kubectl logs -n spark-team-a -l spark-role=driver --follow

Verify Output Data

# Check processed output in S3
aws s3 ls s3://$S3_BUCKET/taxi-trip/output/

# Verify event logs
aws s3 ls s3://$S3_BUCKET/spark-event-logs/

Node Storage Considerations

Advantages

  • Cost Savings: ~70% cost reduction vs per-pod PVCs
  • Higher Performance: Direct node storage access
  • Simplified Operations: No PVC management overhead
  • Better Resource Utilization: Shared storage pool per node

Disadvantages & Mitigation

  • Noisy Neighbors: Multiple pods compete for same storage I/O
    • Mitigation: Use compute-optimized instances with higher IOPS
  • No Isolation: Pods can see each other's temporary data
    • Mitigation: Configure proper directory permissions
  • Storage Sizing: Must pre-size for all workloads on node
    • Mitigation: Monitor usage and adjust volume size

When to Use Node Storage

Good for:

  • Cost-sensitive workloads
  • Predictable I/O patterns
  • Trusted multi-tenant environments
  • Development/testing environments

Avoid for:

  • Security-sensitive workloads requiring isolation
  • Unpredictable I/O bursts
  • Critical production workloads needing guarantees

Storage Class Options

# GP3 - Better price/performance (default)
volumeType: gp3

# IO1 - High IOPS workloads
volumeType: io1

# ST1 - Throughput-optimized
volumeType: st1

Cleanup

# Delete the Spark application
kubectl delete sparkapplication taxi-trip -n spark-team-a

# Node storage persists until node termination
# Data is automatically cleaned up when nodes are replaced

Next Steps