Skip to main content

Spark with NVMe Instance Storage

Achieve maximum performance with Apache Spark using direct NVMe SSD storage attached to instances for ultra-low latency shuffle operations.

Learn to use direct NVMe SSD storage for Spark shuffle storage - the highest performance option with no network I/O overhead.

Prerequisites​

  • Deploy Spark on EKS infrastructure: Infrastructure Setup
  • Latest generation instances with NVMe storage (c6id, c7id, r6id, r7id, m6id, m7id, i4i families)
  • Karpenter RAID0 policy automatically formats and mounts available NVMe storage
Maximum Performance

NVMe instance storage provides the highest I/O performance for Spark workloads with direct access to local SSDs attached to the instance. No network overhead between compute and storage.

Architecture: Direct NVMe SSD Access​

Key Benefits:

  • πŸ”₯ Maximum Performance: 500K+ IOPS vs 16K IOPS (EBS gp3)
  • ⚑ Zero Network Latency: Direct local storage access
  • πŸ’° Cost Included: Storage cost included in instance price
  • πŸš€ Auto-Configuration: Karpenter RAID0 policy handles setup

What is Shuffle Storage in Spark?​

Shuffle storage holds intermediate data during Spark operations like groupBy, join, and reduceByKey. When data is redistributed across executors, it's temporarily stored before being read by subsequent stages.

Spark Shuffle Storage Options​

Storage TypePerformanceCostUse Case
NVMe SSD InstancesπŸ”₯ Very HighπŸ’° HighFeatured - Maximum performance workloads
EBS Dynamic PVC⚑ HighπŸ’° MediumProduction fault tolerance
EBS Node StorageπŸ“Š MediumπŸ’΅ MediumCost-effective shared storage
FSx for LustreπŸ“Š MediumπŸ’΅ LowParallel filesystem for HPC
S3 Express + MountpointπŸ“Š MediumπŸ’΅ LowVery large datasets
Remote Shuffle (Celeborn)⚑ HighπŸ’° MediumResource disaggregation

Benefits: Performance & Cost​

  • NVMe: Fastest local SSD storage, highest IOPS, zero network latency
  • Direct Access: No network overhead between compute and storage
  • Auto-Provisioning: Karpenter automatically detects and configures NVMe

Example Code​

View the complete configuration:

πŸ“„ Complete NVMe Storage Configuration
examples/nvme-storage.yaml
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh

# This example demonstrates NVMe Instance Store Storage features
# Direct access to NVMe SSDs attached to instances
# Maximum performance with local storage - no network I/O overhead
# Requires latest generation instances with NVMe storage (c6id, c7id, r6id, r7id, m6id, m7id, i4i)
# Karpenter RAID0 policy automatically formats and mounts NVMe storage

---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "taxi-trip-nvme"
namespace: spark-team-a
labels:
app: "taxi-trip-nvme"
queue: root.test
spec:
# To create Ingress object for Spark driver.
# Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
# sparkUIOptions:
# servicePort: 4040
# servicePortName: taxi-trip-ui-svc
# serviceType: ClusterIP
# ingressAnnotations:
# kubernetes.io/ingress.class: nginx
# nginx.ingress.kubernetes.io/use-regex: "true"
type: Python
sparkVersion: "3.5.3"
mode: cluster
image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
imagePullPolicy: IfNotPresent
mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py" # MainFile is the path to a bundled JAR, Python, or R file of the application
arguments:
- "s3a://$S3_BUCKET/taxi-trip/input/"
- "s3a://$S3_BUCKET/taxi-trip/output/"
sparkConf:
"spark.app.name": "taxi-trip-nvme"
"spark.kubernetes.driver.pod.name": "taxi-trip-nvme"
"spark.kubernetes.executor.podNamePrefix": "taxi-trip-nvme"
"spark.local.dir": "/data1"
"spark.speculation": "false"
"spark.network.timeout": "2400"

# NVMe Storage Performance Optimizations
"spark.shuffle.spill.diskWriteBufferSize": "1048576" # 1MB buffer for NVMe
"spark.shuffle.file.buffer": "1m" # Larger buffer for local SSD
"spark.io.compression.codec": "lz4" # Fast compression for NVMe
"spark.shuffle.compress": "true"
"spark.shuffle.spill.compress": "true"
"spark.rdd.compress": "true"

# Local storage optimizations for NVMe
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.coalescePartitions.enabled": "true"
"spark.sql.adaptive.localShuffleReader.enabled": "true"
"spark.sql.adaptive.skewJoin.enabled": "true"

# Optimize for high-performance local storage
"spark.sql.files.maxPartitionBytes": "268435456" # 256MB for NVMe throughput
"spark.sql.shuffle.partitions": "400" # Optimize for parallelism
"spark.hadoop.fs.s3a.connection.timeout": "1200000"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.connection.maximum": "200"
"spark.hadoop.fs.s3a.fast.upload": "true"
"spark.hadoop.fs.s3a.readahead.range": "256K"
"spark.hadoop.fs.s3a.input.fadvise": "random"
"spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
"spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider" # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
"spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"

# Spark Event logs
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
"spark.eventLog.rolling.enabled": "true"
"spark.eventLog.rolling.maxFileSize": "64m"
# "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100

# Expose Spark metrics for Prometheus
"spark.ui.prometheus.enabled": "true"
"spark.executor.processTreeMetrics.enabled": "true"
"spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
"spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
"spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"

# NVMe Instance Store Storage Configuration
# Use direct NVMe SSD storage mounted by Karpenter RAID0 policy at /mnt/k8s-disks
# This provides maximum performance with local SSD access and no network overhead
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20

driver:
cores: 2 # Increased for NVMe workload coordination
coreLimit: "2000m"
memory: "8g" # More memory for large dataset coordination
memoryOverhead: "2g" # 25% overhead, more reasonable
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
karpenter.sh/capacity-type: "on-demand"
karpenter.k8s.aws/instance-family: "c6id"
# Driver resource requirements: 2 vCPU, 10GB RAM (minimum c6id.xlarge+)
executor:
cores: 4 # Utilize NVMe instance capacity better
coreLimit: "4000m"
instances: 2
memory: "15g" # Scale up for high-performance workload
memoryOverhead: "3g" # 20% overhead for high-throughput
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86"
karpenter.k8s.aws/instance-family: "c6id"
# Executor resource requirements: 4 vCPU, 18GB RAM per executor (minimum c6id.xlarge+)

NVMe Storage Configuration​

Key configuration for direct NVMe SSD access:

Essential NVMe Storage Settings
sparkConf:
# Direct NVMe SSD access - Driver
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

# Direct NVMe SSD access - Executor
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

# Node selection - uses existing NodePools
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-x86" # For c6id, c7id instances
# Alternative: "memory-optimized-x86" for r6id, r7id instances

Features:

  • hostPath: Uses Karpenter-mounted NVMe storage at /mnt/k8s-disks
  • Auto-RAID0: Karpenter automatically configures RAID0 for multiple NVMe drives
  • Latest Generation: c6id, c7id, r6id, r7id, m6id, m7id, i4i families with high-performance NVMe
  • Zero Network I/O: Direct access to local SSDs

Create Test Data and Run Example​

Process NYC taxi data to demonstrate NVMe storage performance with direct SSD access.

1. Verify Existing x86 NodePools​

# Check existing x86 NodePools (already include NVMe instances)
kubectl get nodepools -n karpenter compute-optimized-x86 memory-optimized-x86

# These NodePools already include:
# - compute-optimized-x86: c6id, c7id (compute + NVMe)
# - memory-optimized-x86: r6id, r7id (memory + NVMe)

2. Prepare Test Data​

cd data-stacks/spark-on-eks/terraform/_local/

# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)

# Navigate to scripts directory and create test data
cd ../../scripts/
./taxi-trip-execute.sh $S3_BUCKET $REGION

Downloads NYC taxi data (1.1GB total) and uploads to S3

3. Execute Spark Job​

# Navigate to example directory
cd ../examples/

# Submit the NVMe Storage job
envsubst < nvme-storage.yaml | kubectl apply -f -

# Monitor node provisioning (should show x86 instances: c6id/c7id with NVMe)
kubectl get nodes -l node.kubernetes.io/workload-type=compute-optimized-x86 --watch

# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch

Expected output:

NAME       STATUS    ATTEMPTS   START                  FINISH                 AGE
taxi-trip COMPLETED 1 2025-09-28T17:03:31Z 2025-09-28T17:08:15Z 4m44s

Verify NVMe Storage Performance​

Monitor NVMe Storage Usage​

# Check which nodes have Spark pods
kubectl get pods -n spark-team-a -o wide

# Check NVMe storage mounting and performance
kubectl exec -n spark-team-a taxi-trip-exec-1 -- df -h

# Expected output shows NVMe mounted at /data1:
# /dev/md0 150G 1.2G 141G 1% /data1

# Verify NVMe storage performance characteristics
kubectl exec -n spark-team-a taxi-trip-exec-1 -- lsblk

# Expected output shows RAID0 of NVMe devices:
# md0 9:0 0 149G 0 raid0 /data1
# β”œβ”€nvme1n1 259:1 0 75G 0 disk
# └─nvme2n1 259:2 0 75G 0 disk

Check Spark Performance with NVMe​

# Check Spark shuffle data on NVMe storage
kubectl exec -n spark-team-a taxi-trip-exec-1 -- ls -la /data1/

# Expected output shows high-performance shuffle operations:
# drwxr-xr-x. 22 spark spark 16384 Sep 28 22:09 blockmgr-7c0ac908-26a3-4395-8a8f-2221b4d5d7c3
# drwxr-xr-x. 13 spark spark 116 Sep 28 22:09 blockmgr-9ed9c2fd-53e1-4337-8a8f-2221b4d5d7c3

# Monitor I/O performance (should show very high IOPS)
kubectl exec -n spark-team-a taxi-trip-exec-1 -- iostat -x 1 3

# View Spark application logs for performance metrics
kubectl logs -n spark-team-a -l spark-role=driver --follow

Verify Output Data​

# Check processed output in S3
aws s3 ls s3://$S3_BUCKET/taxi-trip/output/

# Verify event logs
aws s3 ls s3://$S3_BUCKET/spark-event-logs/

Performance Comparison​

Expected I/O Performance​

Storage TypeIOPSLatencyBandwidth
NVMe c5d.xlarge500,000+<100ΞΌs4+ GB/s
EBS gp316,0001-3ms1 GB/s
EBS gp210,0003-5ms250 MB/s

Latest Generation Instance Families and NVMe Capacity​

Instance FamilyNVMe StorageMemory RangevCPU RangeUse Case
c6id118GB - 7.6TB8GB - 256GB2 - 64Latest compute-optimized
c7id118GB - 14TB8GB - 384GB2 - 96Recommended - newest compute
r6id118GB - 7.6TB16GB - 1TB2 - 64Latest memory-optimized
r7id118GB - 14TB16GB - 1.5TB2 - 96Recommended - newest memory
m6id118GB - 7.6TB8GB - 512GB2 - 64Latest general-purpose
m7id118GB - 14TB8GB - 768GB2 - 96Recommended - newest general
i4i468GB - 30TB12GB - 768GB2 - 128Maximum NVMe storage

NVMe Storage Considerations​

Advantages​

  • Maximum Performance: 500K+ IOPS vs 16K (EBS gp3)
  • Zero Network Latency: Direct local storage access
  • Cost Included: Storage cost included in instance price
  • Auto-Configuration: Karpenter handles RAID0 setup

Disadvantages & Mitigation​

  • Ephemeral Storage: Data lost on instance termination
    • Mitigation: Use S3 for persistent data, NVMe for shuffle only
  • Fixed Size: Cannot resize storage after launch
    • Mitigation: Choose appropriate instance type for workload
  • Higher Instance Cost: NVMe instances cost 10-20% more
    • Mitigation: Performance gains often justify cost

When to Use NVMe Storage​

βœ… Good for:

  • Performance-critical Spark workloads
  • Large shuffle operations
  • Real-time analytics
  • Machine learning training

❌ Avoid for:

  • Small datasets that fit in memory
  • Cost-sensitive development workloads
  • Jobs requiring persistent storage

Cleanup​

# Delete the Spark application
kubectl delete sparkapplication taxi-trip -n spark-team-a

# NVMe storage is automatically cleaned up when nodes terminate
# Note: x86 NodePools are shared and remain for other workloads

Next Steps​