Spark with NVMe Instance Storage
Achieve maximum performance with Apache Spark using direct NVMe SSD storage attached to instances for ultra-low latency shuffle operations.
Learn to use direct NVMe SSD storage for Spark shuffle storage - the highest performance option with no network I/O overhead.
Prerequisitesβ
- Deploy Spark on EKS infrastructure: Infrastructure Setup
- Latest generation instances with NVMe storage (c6id, c7id, r6id, r7id, m6id, m7id, i4i families)
- Karpenter RAID0 policy automatically formats and mounts available NVMe storage
NVMe instance storage provides the highest I/O performance for Spark workloads with direct access to local SSDs attached to the instance. No network overhead between compute and storage.
Architecture: Direct NVMe SSD Accessβ
Key Benefits:
- π₯ Maximum Performance: 500K+ IOPS vs 16K IOPS (EBS gp3)
- β‘ Zero Network Latency: Direct local storage access
- π° Cost Included: Storage cost included in instance price
- π Auto-Configuration: Karpenter RAID0 policy handles setup
What is Shuffle Storage in Spark?β
Shuffle storage holds intermediate data during Spark operations like groupBy, join, and reduceByKey. When data is redistributed across executors, it's temporarily stored before being read by subsequent stages.
Spark Shuffle Storage Optionsβ
| Storage Type | Performance | Cost | Use Case |
|---|---|---|---|
| NVMe SSD Instances | π₯ Very High | π° High | Featured - Maximum performance workloads |
| EBS Dynamic PVC | β‘ High | π° Medium | Production fault tolerance |
| EBS Node Storage | π Medium | π΅ Medium | Cost-effective shared storage |
| FSx for Lustre | π Medium | π΅ Low | Parallel filesystem for HPC |
| S3 Express + Mountpoint | π Medium | π΅ Low | Very large datasets |
| Remote Shuffle (Celeborn) | β‘ High | π° Medium | Resource disaggregation |
Benefits: Performance & Costβ
- NVMe: Fastest local SSD storage, highest IOPS, zero network latency
- Direct Access: No network overhead between compute and storage
- Auto-Provisioning: Karpenter automatically detects and configures NVMe
Example Codeβ
View the complete configuration:
π Complete NVMe Storage Configuration
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh
# This example demonstrates NVMe Instance Store Storage features
# Direct access to NVMe SSDs attached to instances
# Maximum performance with local storage - no network I/O overhead
# Requires latest generation instances with NVMe storage (c6id, c7id, r6id, r7id, m6id, m7id, i4i)
# Karpenter RAID0 policy automatically formats and mounts NVMe storage
---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "taxi-trip-nvme"
namespace: spark-team-a
labels:
app: "taxi-trip-nvme"
queue: root.test
spec:
# To create Ingress object for Spark driver.
# Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
# sparkUIOptions:
# servicePort: 4040
# servicePortName: taxi-trip-ui-svc
# serviceType: ClusterIP
# ingressAnnotations:
# kubernetes.io/ingress.class: nginx
# nginx.ingress.kubernetes.io/use-regex: "true"
type: Python
sparkVersion: "3.5.3"
mode: cluster
image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
imagePullPolicy: IfNotPresent
mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py" # MainFile is the path to a bundled JAR, Python, or R file of the application
arguments:
- "s3a://$S3_BUCKET/taxi-trip/input/"
- "s3a://$S3_BUCKET/taxi-trip/output/"
sparkConf:
"spark.app.name": "taxi-trip-nvme"
"spark.kubernetes.driver.pod.name": "taxi-trip-nvme"
"spark.kubernetes.executor.podNamePrefix": "taxi-trip-nvme"
"spark.local.dir": "/data1"
"spark.speculation": "false"
"spark.network.timeout": "2400"
# NVMe Storage Performance Optimizations
"spark.shuffle.spill.diskWriteBufferSize": "1048576" # 1MB buffer for NVMe
"spark.shuffle.file.buffer": "1m" # Larger buffer for local SSD
"spark.io.compression.codec": "lz4" # Fast compression for NVMe
"spark.shuffle.compress": "true"
"spark.shuffle.spill.compress": "true"
"spark.rdd.compress": "true"
# Local storage optimizations for NVMe
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.coalescePartitions.enabled": "true"
"spark.sql.adaptive.localShuffleReader.enabled": "true"
"spark.sql.adaptive.skewJoin.enabled": "true"
# Optimize for high-performance local storage
"spark.sql.files.maxPartitionBytes": "268435456" # 256MB for NVMe throughput
"spark.sql.shuffle.partitions": "400" # Optimize for parallelism
"spark.hadoop.fs.s3a.connection.timeout": "1200000"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.connection.maximum": "200"
"spark.hadoop.fs.s3a.fast.upload": "true"
"spark.hadoop.fs.s3a.readahead.range": "256K"
"spark.hadoop.fs.s3a.input.fadvise": "random"
"spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
"spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider" # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
"spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
# Spark Event logs
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
"spark.eventLog.rolling.enabled": "true"
"spark.eventLog.rolling.maxFileSize": "64m"
# "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100
# Expose Spark metrics for Prometheus
"spark.ui.prometheus.enabled": "true"
"spark.executor.processTreeMetrics.enabled": "true"
"spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
"spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
"spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"
# NVMe Instance Store Storage Configuration
# Use direct NVMe SSD storage mounted by Karpenter RAID0 policy at /mnt/k8s-disks
# This provides maximum performance with local SSD access and no network overhead
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
cores: 2 # Increased for NVMe workload coordination
coreLimit: "2000m"
memory: "8g" # More memory for large dataset coordination
memoryOverhead: "2g" # 25% overhead, more reasonable
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
karpenter.sh/capacity-type: "on-demand"
karpenter.k8s.aws/instance-family: "c6id"
# Driver resource requirements: 2 vCPU, 10GB RAM (minimum c6id.xlarge+)
executor:
cores: 4 # Utilize NVMe instance capacity better
coreLimit: "4000m"
instances: 2
memory: "15g" # Scale up for high-performance workload
memoryOverhead: "3g" # 20% overhead for high-throughput
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
karpenter.k8s.aws/instance-family: "c6id"
# Executor resource requirements: 4 vCPU, 18GB RAM per executor (minimum c6id.xlarge+)
NVMe Storage Configurationβ
Key configuration for direct NVMe SSD access:
sparkConf:
# Direct NVMe SSD access - Driver
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"
# Direct NVMe SSD access - Executor
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"
# Node selection - uses existing NodePools
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86" # For c6id, c7id instances
# Alternative: "memory-optimized-x86" for r6id, r7id instances
Features:
hostPath: Uses Karpenter-mounted NVMe storage at/mnt/k8s-disks- Auto-RAID0: Karpenter automatically configures RAID0 for multiple NVMe drives
- Latest Generation: c6id, c7id, r6id, r7id, m6id, m7id, i4i families with high-performance NVMe
- Zero Network I/O: Direct access to local SSDs
Create Test Data and Run Exampleβ
Process NYC taxi data to demonstrate NVMe storage performance with direct SSD access.
1. Verify Existing x86 NodePoolsβ
# Check existing x86 NodePools (already include NVMe instances)
kubectl get nodepools -n karpenter compute-optimized-x86 memory-optimized-x86
# These NodePools already include:
# - compute-optimized-x86: c6id, c7id (compute + NVMe)
# - memory-optimized-x86: r6id, r7id (memory + NVMe)
2. Prepare Test Dataβ
cd data-stacks/spark-on-eks/terraform/_local/
# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)
# Navigate to scripts directory and create test data
cd ../../scripts/
./taxi-trip-execute.sh $S3_BUCKET $REGION
Downloads NYC taxi data (1.1GB total) and uploads to S3
3. Execute Spark Jobβ
# Navigate to example directory
cd ../examples/
# Submit the NVMe Storage job
envsubst < nvme-storage.yaml | kubectl apply -f -
# Monitor node provisioning (should show x86 instances: c6id/c7id with NVMe)
kubectl get nodes -l node.kubernetes.io/workload-type=compute-optimized-x86 --watch
# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch
Expected output:
NAME STATUS ATTEMPTS START FINISH AGE
taxi-trip COMPLETED 1 2025-09-28T17:03:31Z 2025-09-28T17:08:15Z 4m44s
Verify NVMe Storage Performanceβ
Monitor NVMe Storage Usageβ
# Check which nodes have Spark pods
kubectl get pods -n spark-team-a -o wide
# Check NVMe storage mounting and performance
kubectl exec -n spark-team-a taxi-trip-exec-1 -- df -h
# Expected output shows NVMe mounted at /data1:
# /dev/md0 150G 1.2G 141G 1% /data1
# Verify NVMe storage performance characteristics
kubectl exec -n spark-team-a taxi-trip-exec-1 -- lsblk
# Expected output shows RAID0 of NVMe devices:
# md0 9:0 0 149G 0 raid0 /data1
# ββnvme1n1 259:1 0 75G 0 disk
# ββnvme2n1 259:2 0 75G 0 disk
Check Spark Performance with NVMeβ
# Check Spark shuffle data on NVMe storage
kubectl exec -n spark-team-a taxi-trip-exec-1 -- ls -la /data1/
# Expected output shows high-performance shuffle operations:
# drwxr-xr-x. 22 spark spark 16384 Sep 28 22:09 blockmgr-7c0ac908-26a3-4395-8a8f-2221b4d5d7c3
# drwxr-xr-x. 13 spark spark 116 Sep 28 22:09 blockmgr-9ed9c2fd-53e1-4337-8a8f-2221b4d5d7c3
# Monitor I/O performance (should show very high IOPS)
kubectl exec -n spark-team-a taxi-trip-exec-1 -- iostat -x 1 3
# View Spark application logs for performance metrics
kubectl logs -n spark-team-a -l spark-role=driver --follow
Verify Output Dataβ
# Check processed output in S3
aws s3 ls s3://$S3_BUCKET/taxi-trip/output/
# Verify event logs
aws s3 ls s3://$S3_BUCKET/spark-event-logs/
Performance Comparisonβ
Expected I/O Performanceβ
| Storage Type | IOPS | Latency | Bandwidth |
|---|---|---|---|
| NVMe c5d.xlarge | 500,000+ | <100ΞΌs | 4+ GB/s |
| EBS gp3 | 16,000 | 1-3ms | 1 GB/s |
| EBS gp2 | 10,000 | 3-5ms | 250 MB/s |
Latest Generation Instance Families and NVMe Capacityβ
| Instance Family | NVMe Storage | Memory Range | vCPU Range | Use Case |
|---|---|---|---|---|
| c6id | 118GB - 7.6TB | 8GB - 256GB | 2 - 64 | Latest compute-optimized |
| c7id | 118GB - 14TB | 8GB - 384GB | 2 - 96 | Recommended - newest compute |
| r6id | 118GB - 7.6TB | 16GB - 1TB | 2 - 64 | Latest memory-optimized |
| r7id | 118GB - 14TB | 16GB - 1.5TB | 2 - 96 | Recommended - newest memory |
| m6id | 118GB - 7.6TB | 8GB - 512GB | 2 - 64 | Latest general-purpose |
| m7id | 118GB - 14TB | 8GB - 768GB | 2 - 96 | Recommended - newest general |
| i4i | 468GB - 30TB | 12GB - 768GB | 2 - 128 | Maximum NVMe storage |
NVMe Storage Considerationsβ
Advantagesβ
- Maximum Performance: 500K+ IOPS vs 16K (EBS gp3)
- Zero Network Latency: Direct local storage access
- Cost Included: Storage cost included in instance price
- Auto-Configuration: Karpenter handles RAID0 setup
Disadvantages & Mitigationβ
- Ephemeral Storage: Data lost on instance termination
- Mitigation: Use S3 for persistent data, NVMe for shuffle only
- Fixed Size: Cannot resize storage after launch
- Mitigation: Choose appropriate instance type for workload
- Higher Instance Cost: NVMe instances cost 10-20% more
- Mitigation: Performance gains often justify cost
When to Use NVMe Storageβ
β Good for:
- Performance-critical Spark workloads
- Large shuffle operations
- Real-time analytics
- Machine learning training
β Avoid for:
- Small datasets that fit in memory
- Cost-sensitive development workloads
- Jobs requiring persistent storage
Cleanupβ
# Delete the Spark application
kubectl delete sparkapplication taxi-trip -n spark-team-a
# NVMe storage is automatically cleaned up when nodes terminate
# Note: x86 NodePools are shared and remain for other workloads
Next Stepsβ
- EBS Dynamic PVC Storage - Production fault tolerance
- EBS Node Storage - Cost-effective alternative
- Infrastructure Setup - Deploy base infrastructure