Spark with NVMe Instance Storage

Achieve maximum performance with Apache Spark using direct NVMe SSD storage attached to instances for ultra-low latency shuffle operations.

Learn to use direct NVMe SSD storage for Spark shuffle storage - the highest performance option with no network I/O overhead.

Prerequisites

Deploy Spark on EKS infrastructure: Infrastructure Setup
Latest generation instances with NVMe storage (c6id, c7id, r6id, r7id, m6id, m7id, i4i families)
Karpenter RAID0 policy automatically formats and mounts available NVMe storage

Maximum Performance

NVMe instance storage provides the highest I/O performance for Spark workloads with direct access to local SSDs attached to the instance. No network overhead between compute and storage.

Architecture: Direct NVMe SSD Access

Key Benefits:

🔥 Maximum Performance: 500K+ IOPS vs 16K IOPS (EBS gp3)
⚡ Zero Network Latency: Direct local storage access
💰 Cost Included: Storage cost included in instance price
🚀 Auto-Configuration: Karpenter RAID0 policy handles setup

What is Shuffle Storage in Spark?

Shuffle storage holds intermediate data during Spark operations like groupBy, join, and reduceByKey. When data is redistributed across executors, it's temporarily stored before being read by subsequent stages.

Spark Shuffle Storage Options

Storage Type	Performance	Cost	Use Case
NVMe SSD Instances	🔥 Very High	💰 High	Featured - Maximum performance workloads
EBS Dynamic PVC	⚡ High	💰 Medium	Production fault tolerance
EBS Node Storage	📊 Medium	💵 Medium	Cost-effective shared storage
FSx for Lustre	📊 Medium	💵 Low	Parallel filesystem for HPC
S3 Express + Mountpoint	📊 Medium	💵 Low	Very large datasets
Remote Shuffle (Celeborn)	⚡ High	💰 Medium	Resource disaggregation

Benefits: Performance & Cost

NVMe: Fastest local SSD storage, highest IOPS, zero network latency
Direct Access: No network overhead between compute and storage
Auto-Provisioning: Karpenter automatically detects and configures NVMe

Example Code

View the complete configuration:

📄 Complete NVMe Storage Configuration

examples/nvme-storage.yaml
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh

# This example demonstrates NVMe Instance Store Storage features
  # Direct access to NVMe SSDs attached to instances
  # Maximum performance with local storage - no network I/O overhead
  # Requires latest generation instances with NVMe storage (c6id, c7id, r6id, r7id, m6id, m7id, i4i)
  # Karpenter RAID0 policy automatically formats and mounts NVMe storage

---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: "taxi-trip-nvme"
  namespace: spark-team-a
  labels:
    app: "taxi-trip-nvme"
    queue: root.test
spec:
#  To create Ingress object for Spark driver.
#  Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
#  sparkUIOptions:
#    servicePort: 4040
#    servicePortName: taxi-trip-ui-svc
#    serviceType: ClusterIP
#    ingressAnnotations:
#      kubernetes.io/ingress.class: nginx
#      nginx.ingress.kubernetes.io/use-regex: "true"
  type: Python
  sparkVersion: "3.5.3"
  mode: cluster
  image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
  imagePullPolicy: IfNotPresent
  mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py"  # MainFile is the path to a bundled JAR, Python, or R file of the application
  arguments:
    - "s3a://$S3_BUCKET/taxi-trip/input/"
    - "s3a://$S3_BUCKET/taxi-trip/output/"
  sparkConf:
    "spark.app.name": "taxi-trip-nvme"
    "spark.kubernetes.driver.pod.name": "taxi-trip-nvme"
    "spark.kubernetes.executor.podNamePrefix": "taxi-trip-nvme"
    "spark.local.dir": "/data1"
    "spark.speculation": "false"
    "spark.network.timeout": "2400"

    # NVMe Storage Performance Optimizations
    "spark.shuffle.spill.diskWriteBufferSize": "1048576"  # 1MB buffer for NVMe
    "spark.shuffle.file.buffer": "1m"  # Larger buffer for local SSD
    "spark.io.compression.codec": "lz4"  # Fast compression for NVMe
    "spark.shuffle.compress": "true"
    "spark.shuffle.spill.compress": "true"
    "spark.rdd.compress": "true"

    # Local storage optimizations for NVMe
    "spark.sql.adaptive.enabled": "true"
    "spark.sql.adaptive.coalescePartitions.enabled": "true"
    "spark.sql.adaptive.localShuffleReader.enabled": "true"
    "spark.sql.adaptive.skewJoin.enabled": "true"

    # Optimize for high-performance local storage
    "spark.sql.files.maxPartitionBytes": "268435456"  # 256MB for NVMe throughput
    "spark.sql.shuffle.partitions": "400"  # Optimize for parallelism
    "spark.hadoop.fs.s3a.connection.timeout": "1200000"
    "spark.hadoop.fs.s3a.path.style.access": "true"
    "spark.hadoop.fs.s3a.connection.maximum": "200"
    "spark.hadoop.fs.s3a.fast.upload": "true"
    "spark.hadoop.fs.s3a.readahead.range": "256K"
    "spark.hadoop.fs.s3a.input.fadvise": "random"
    "spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
    "spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"  # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
    "spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"

    # Spark Event logs
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
    "spark.eventLog.rolling.enabled": "true"
    "spark.eventLog.rolling.maxFileSize": "64m"
    # "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100

    # Expose Spark metrics for Prometheus
    "spark.ui.prometheus.enabled": "true"
    "spark.executor.processTreeMetrics.enabled": "true"
    "spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
    "spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
    "spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"

    # NVMe Instance Store Storage Configuration
    # Use direct NVMe SSD storage mounted by Karpenter RAID0 policy at /mnt/k8s-disks
    # This provides maximum performance with local SSD access and no network overhead
    "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
    "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
    "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
    "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

    "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
    "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
    "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
    "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20

  driver:
    cores: 2  # Increased for NVMe workload coordination
    coreLimit: "2000m"
    memory: "8g"  # More memory for large dataset coordination
    memoryOverhead: "2g"  # 25% overhead, more reasonable
    serviceAccount: spark-team-a
    labels:
      version: 3.5.3
    nodeSelector:
      node.kubernetes.io/workload-type: "compute-optimized-x86"
      karpenter.sh/capacity-type: "on-demand"
      karpenter.k8s.aws/instance-family: "c6id"
    # Driver resource requirements: 2 vCPU, 10GB RAM (minimum c6id.xlarge+)
  executor:
    cores: 4  # Utilize NVMe instance capacity better
    coreLimit: "4000m"
    instances: 2
    memory: "15g"  # Scale up for high-performance workload
    memoryOverhead: "3g"  # 20% overhead for high-throughput
    serviceAccount: spark-team-a
    labels:
      version: 3.5.3
    nodeSelector:
      node.kubernetes.io/workload-type: "compute-optimized-x86"
      karpenter.k8s.aws/instance-family: "c6id"
    # Executor resource requirements: 4 vCPU, 18GB RAM per executor (minimum c6id.xlarge+)

NVMe Storage Configuration

Key configuration for direct NVMe SSD access:

Essential NVMe Storage Settings
sparkConf:
  # Direct NVMe SSD access - Driver
  "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
  "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
  "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
  "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

  # Direct NVMe SSD access - Executor
  "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
  "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
  "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
  "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

# Node selection - uses existing NodePools
nodeSelector:
  node.kubernetes.io/workload-type: "compute-optimized-x86"  # For c6id, c7id instances
  # Alternative: "memory-optimized-x86" for r6id, r7id instances

Features:

hostPath: Uses Karpenter-mounted NVMe storage at /mnt/k8s-disks
Auto-RAID0: Karpenter automatically configures RAID0 for multiple NVMe drives
Latest Generation: c6id, c7id, r6id, r7id, m6id, m7id, i4i families with high-performance NVMe
Zero Network I/O: Direct access to local SSDs

Create Test Data and Run Example

Process NYC taxi data to demonstrate NVMe storage performance with direct SSD access.

1. Verify Existing x86 NodePools

# Check existing x86 NodePools (already include NVMe instances)
kubectl get nodepools -n karpenter compute-optimized-x86 memory-optimized-x86

# These NodePools already include:
# - compute-optimized-x86: c6id, c7id (compute + NVMe)
# - memory-optimized-x86: r6id, r7id (memory + NVMe)

2. Prepare Test Data

cd data-stacks/spark-on-eks/terraform/_local/

# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)

# Navigate to scripts directory and create test data
cd ../../scripts/
./taxi-trip-execute.sh $S3_BUCKET $REGION

Downloads NYC taxi data (1.1GB total) and uploads to S3

3. Execute Spark Job

# Navigate to example directory
cd ../examples/

# Submit the NVMe Storage job
envsubst < nvme-storage.yaml | kubectl apply -f -

# Monitor node provisioning (should show x86 instances: c6id/c7id with NVMe)
kubectl get nodes -l node.kubernetes.io/workload-type=compute-optimized-x86 --watch

# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch

Expected output:

NAME       STATUS    ATTEMPTS   START                  FINISH                 AGE
taxi-trip  COMPLETED 1          2025-09-28T17:03:31Z   2025-09-28T17:08:15Z   4m44s

Verify NVMe Storage Performance

Monitor NVMe Storage Usage

# Check which nodes have Spark pods
kubectl get pods -n spark-team-a -o wide

# Check NVMe storage mounting and performance
kubectl exec -n spark-team-a taxi-trip-exec-1 -- df -h

# Expected output shows NVMe mounted at /data1:
# /dev/md0        150G  1.2G  141G   1% /data1

# Verify NVMe storage performance characteristics
kubectl exec -n spark-team-a taxi-trip-exec-1 -- lsblk

# Expected output shows RAID0 of NVMe devices:
# md0       9:0    0  149G  0 raid0 /data1
# ├─nvme1n1 259:1   0  75G  0 disk
# └─nvme2n1 259:2   0  75G  0 disk

Check Spark Performance with NVMe

# Check Spark shuffle data on NVMe storage
kubectl exec -n spark-team-a taxi-trip-exec-1 -- ls -la /data1/

# Expected output shows high-performance shuffle operations:
# drwxr-xr-x. 22 spark spark 16384 Sep 28 22:09 blockmgr-7c0ac908-26a3-4395-8a8f-2221b4d5d7c3
# drwxr-xr-x. 13 spark spark   116 Sep 28 22:09 blockmgr-9ed9c2fd-53e1-4337-8a8f-2221b4d5d7c3

# Monitor I/O performance (should show very high IOPS)
kubectl exec -n spark-team-a taxi-trip-exec-1 -- iostat -x 1 3

# View Spark application logs for performance metrics
kubectl logs -n spark-team-a -l spark-role=driver --follow

Verify Output Data

# Check processed output in S3
aws s3 ls s3://$S3_BUCKET/taxi-trip/output/

# Verify event logs
aws s3 ls s3://$S3_BUCKET/spark-event-logs/

Performance Comparison

Expected I/O Performance

Storage Type	IOPS	Latency	Bandwidth
NVMe c5d.xlarge	500,000+	<100μs	4+ GB/s
EBS gp3	16,000	1-3ms	1 GB/s
EBS gp2	10,000	3-5ms	250 MB/s

Latest Generation Instance Families and NVMe Capacity

Instance Family	NVMe Storage	Memory Range	vCPU Range	Use Case
c6id	118GB - 7.6TB	8GB - 256GB	2 - 64	Latest compute-optimized
c7id	118GB - 14TB	8GB - 384GB	2 - 96	Recommended - newest compute
r6id	118GB - 7.6TB	16GB - 1TB	2 - 64	Latest memory-optimized
r7id	118GB - 14TB	16GB - 1.5TB	2 - 96	Recommended - newest memory
m6id	118GB - 7.6TB	8GB - 512GB	2 - 64	Latest general-purpose
m7id	118GB - 14TB	8GB - 768GB	2 - 96	Recommended - newest general
i4i	468GB - 30TB	12GB - 768GB	2 - 128	Maximum NVMe storage

NVMe Storage Considerations

Advantages

Maximum Performance: 500K+ IOPS vs 16K (EBS gp3)
Zero Network Latency: Direct local storage access
Cost Included: Storage cost included in instance price
Auto-Configuration: Karpenter handles RAID0 setup

Disadvantages & Mitigation

Ephemeral Storage: Data lost on instance termination
- Mitigation: Use S3 for persistent data, NVMe for shuffle only
Fixed Size: Cannot resize storage after launch
- Mitigation: Choose appropriate instance type for workload
Higher Instance Cost: NVMe instances cost 10-20% more
- Mitigation: Performance gains often justify cost

When to Use NVMe Storage

✅ Good for:

Performance-critical Spark workloads
Large shuffle operations
Real-time analytics
Machine learning training

❌ Avoid for:

Small datasets that fit in memory
Cost-sensitive development workloads
Jobs requiring persistent storage

Cleanup

# Delete the Spark application
kubectl delete sparkapplication taxi-trip -n spark-team-a

# NVMe storage is automatically cleaned up when nodes terminate
# Note: x86 NodePools are shared and remain for other workloads

Next Steps

EBS Dynamic PVC Storage - Production fault tolerance
EBS Node Storage - Cost-effective alternative
Infrastructure Setup - Deploy base infrastructure

Prerequisites​

Architecture: Direct NVMe SSD Access​

What is Shuffle Storage in Spark?​

Spark Shuffle Storage Options​

Benefits: Performance & Cost​

Example Code​

NVMe Storage Configuration​

Create Test Data and Run Example​

1. Verify Existing x86 NodePools​

2. Prepare Test Data​

3. Execute Spark Job​

Verify NVMe Storage Performance​

Monitor NVMe Storage Usage​

Check Spark Performance with NVMe​

Verify Output Data​

Performance Comparison​

Expected I/O Performance​

Latest Generation Instance Families and NVMe Capacity​

NVMe Storage Considerations​

Advantages​

Disadvantages & Mitigation​

When to Use NVMe Storage​

Cleanup​

Next Steps​