Spark with Graviton NVMe Instance Storage
Achieve maximum performance with Apache Spark using ARM64 Graviton processors and direct NVMe SSD storage for ultra-low latency shuffle operations.
🚀 Recommended: Graviton instances provide superior price-performance with up to 40% better cost efficiency compared to x86 instances.
Prerequisites
- Deploy Spark on EKS infrastructure: Infrastructure Setup
- Latest generation Graviton instances with NVMe storage (c6gd, c7gd, r6gd, r7gd, m6gd, m7gd, i4g, im4gn families)
- Karpenter RAID0 policy automatically formats and mounts available NVMe storage
AWS Graviton4 processors deliver up to 30% better compute performance and 75% more memory bandwidth than previous generation. Combined with NVMe storage, this provides the highest performance option for Spark workloads.
Architecture: Graviton ARM64 + Direct NVMe SSD Access
Key Benefits:
- 🔥 Maximum Performance: Graviton4 + NVMe SSD combination
- 💰 Best Price-Performance: Up to 40% cost savings vs x86
- ⚡ Zero Network Latency: Direct local storage access
- 🌱 Sustainable: ARM64 architecture with better energy efficiency
Graviton Instance Families and NVMe Capacity
| Instance Family | NVMe Storage | Memory Range | vCPU Range | Use Case |
|---|---|---|---|---|
| c6gd | 118GB - 3.8TB | 8GB - 128GB | 2 - 32 | Graviton3 compute-optimized |
| c7gd | 118GB - 7.6TB | 8GB - 192GB | 2 - 48 | Recommended - Graviton3 latest |
| r6gd | 118GB - 3.8TB | 16GB - 512GB | 2 - 32 | Graviton3 memory-optimized |
| r7gd | 118GB - 7.6TB | 16GB - 768GB | 2 - 48 | Recommended - Graviton3 latest |
| m6gd | 118GB - 3.8TB | 8GB - 256GB | 2 - 32 | Graviton3 general-purpose |
| m7gd | 118GB - 7.6TB | 8GB - 384GB | 2 - 48 | Recommended - Graviton3 latest |
| i4g | 468GB - 30TB | 12GB - 384GB | 2 - 48 | Maximum NVMe storage |
Performance Benchmarks
For detailed Graviton performance benchmarks and comparisons: 📊 Graviton Spark Benchmarks
Example Code
View the complete configuration:
📄 Complete Graviton NVMe Storage Configuration
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh
# This example demonstrates Graviton ARM64 with NVMe Instance Store Storage features
# Direct access to NVMe SSDs attached to ARM64 Graviton instances
# Maximum performance with local storage - no network I/O overhead
# Uses latest generation Graviton instances with NVMe storage (c6gd, c7gd, r6gd, r7gd, m6gd, m7gd, i4g, im4gn)
# Karpenter RAID0 policy automatically formats and mounts NVMe storage
---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "taxi-trip-graviton"
namespace: spark-team-a
labels:
app: "taxi-trip-graviton"
queue: root.test
spec:
# To create Ingress object for Spark driver.
# Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
# sparkUIOptions:
# servicePort: 4040
# servicePortName: taxi-trip-ui-svc
# serviceType: ClusterIP
# ingressAnnotations:
# kubernetes.io/ingress.class: nginx
# nginx.ingress.kubernetes.io/use-regex: "true"
type: Python
sparkVersion: "3.5.3"
mode: cluster
image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
imagePullPolicy: IfNotPresent
mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py" # MainFile is the path to a bundled JAR, Python, or R file of the application
arguments:
- "s3a://$S3_BUCKET/taxi-trip/input/"
- "s3a://$S3_BUCKET/taxi-trip/output/"
sparkConf:
"spark.app.name": "taxi-trip-graviton"
"spark.kubernetes.driver.pod.name": "taxi-trip-graviton"
"spark.kubernetes.executor.podNamePrefix": "taxi-trip-graviton"
"spark.local.dir": "/data1"
"spark.speculation": "false"
"spark.network.timeout": "2400"
# NVMe Storage Performance Optimizations for Graviton ARM64
"spark.shuffle.spill.diskWriteBufferSize": "1048576" # 1MB buffer for NVMe
"spark.shuffle.file.buffer": "1m" # Larger buffer for local SSD
"spark.io.compression.codec": "lz4" # Fast compression for NVMe
"spark.shuffle.compress": "true"
"spark.shuffle.spill.compress": "true"
"spark.rdd.compress": "true"
# Local storage optimizations for NVMe on Graviton
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.coalescePartitions.enabled": "true"
"spark.sql.adaptive.localShuffleReader.enabled": "true"
"spark.sql.adaptive.skewJoin.enabled": "true"
# Optimize for high-performance local storage on ARM64
"spark.sql.files.maxPartitionBytes": "268435456" # 256MB for NVMe throughput
"spark.sql.shuffle.partitions": "400" # Optimize for parallelism
"spark.hadoop.fs.s3a.connection.timeout": "1200000"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.connection.maximum": "200"
"spark.hadoop.fs.s3a.fast.upload": "true"
"spark.hadoop.fs.s3a.readahead.range": "256K"
"spark.hadoop.fs.s3a.input.fadvise": "random"
"spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
"spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider" # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
"spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
# Spark Event logs
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
"spark.eventLog.rolling.enabled": "true"
"spark.eventLog.rolling.maxFileSize": "64m"
# "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100
# Expose Spark metrics for Prometheus
"spark.ui.prometheus.enabled": "true"
"spark.executor.processTreeMetrics.enabled": "true"
"spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
"spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
"spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"
# Graviton ARM64 NVMe Instance Store Storage Configuration
# Use direct NVMe SSD storage mounted by Karpenter RAID0 policy at /mnt/k8s-disks
# This provides maximum performance with local SSD access and no network overhead on ARM64
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
initContainers:
- name: volume-permissions
image: busybox:1.36
command: [ 'sh', '-c', 'chown -R 185 /data1' ]
volumeMounts:
- mountPath: "/data1"
name: "spark-local-dir-1"
cores: 2 # Increased for NVMe workload coordination on Graviton
coreLimit: "2000m"
memory: "8g" # More memory for large dataset coordination
memoryOverhead: "2g" # 25% overhead, more reasonable
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-graviton"
karpenter.sh/capacity-type: "on-demand"
karpenter.k8s.aws/instance-family: "c6gd"
# Driver resource requirements: 2 vCPU, 10GB RAM (minimum c6gd.xlarge+)
executor:
initContainers:
- name: volume-permissions
image: busybox:1.36
command: [ 'sh', '-c', 'chown -R 185 /data1' ]
volumeMounts:
- mountPath: "/data1"
name: "spark-local-dir-1"
cores: 4 # Utilize Graviton NVMe instance capacity better
coreLimit: "4000m"
instances: 2
memory: "15g" # Scale up for high-performance workload on ARM64
memoryOverhead: "3g" # 20% overhead for high-throughput
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-graviton"
karpenter.k8s.aws/instance-family: "c6gd"
# Executor resource requirements: 4 vCPU, 18GB RAM per executor (minimum c6gd.xlarge+)
Graviton NVMe Storage Configuration
Key configuration for ARM64 Graviton with direct NVMe SSD access:
sparkConf:
# ARM64 NVMe Performance Optimizations
"spark.shuffle.spill.diskWriteBufferSize": "1048576" # 1MB buffer for NVMe
"spark.shuffle.file.buffer": "1m" # Larger buffer for local SSD
"spark.io.compression.codec": "lz4" # Fast compression optimized for ARM64
# Direct NVMe SSD access - Driver
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
# Direct NVMe SSD access - Executor
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
# Graviton NVMe node selection - targets specific NVMe instances
driver:
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-graviton"
karpenter.k8s.aws/instance-family: "c6gd" # Ensures NVMe instances
initContainers:
- name: volume-permissions
image: busybox:1.36 # Multi-arch image for ARM64 support
command: ['sh', '-c', 'chown -R 185 /data1']
executor:
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-graviton"
karpenter.k8s.aws/instance-family: "c6gd" # Ensures NVMe instances
Features:
- ARM64 Architecture: Native Graviton optimization
- Auto-RAID0: Karpenter automatically configures RAID0 for multiple NVMe drives
- Latest Generation: c7gd, r7gd, m7gd families with Graviton3 processors
- Zero Network I/O: Direct access to local SSDs
Deploy and Test
1. Verify Existing Graviton NodePools
# Check existing Graviton NodePools (already include NVMe instances)
kubectl get nodepools -n karpenter compute-optimized-graviton memory-optimized-graviton
# These NodePools already include:
# - compute-optimized-graviton: c6gd, c7gd (compute + NVMe)
# - memory-optimized-graviton: r6gd, r7gd (memory + NVMe)
2. Execute Spark Job on Graviton
cd data-stacks/spark-on-eks/terraform/_local/
# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)
# Navigate to example directory
cd ../../examples/
# Submit the Graviton NVMe Storage job
envsubst < nvme-storage-graviton.yaml | kubectl apply -f -
# Monitor node provisioning (should show Graviton instances: c6gd/c7gd with NVMe)
kubectl get nodes -l node.kubernetes.io/workload-type=compute-optimized-graviton --watch
# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch
Expected output:
NAME STATUS ATTEMPTS START FINISH AGE
taxi-trip-graviton COMPLETED 1 2025-09-28T17:03:31Z 2025-09-28T17:08:15Z 4m44s
Performance Comparison
Expected Performance Characteristics
| Metric | Graviton + NVMe | x86 + NVMe | Improvement |
|---|---|---|---|
| Price-Performance | Best | Good | 40% better |
| Compute Performance | High | High | 30% better |
| Memory Bandwidth | Very High | High | 75% more |
| Energy Efficiency | Excellent | Good | 60% better |
Why Choose Graviton for Spark
✅ Superior for:
- Cost-sensitive production workloads
- Large-scale data processing
- Memory-intensive analytics
- Sustainable computing initiatives
✅ Graviton Advantages:
- Up to 40% better price-performance
- Higher memory bandwidth for in-memory processing
- Better energy efficiency
- Native ARM64 ecosystem support
Cleanup
# Delete the Spark application
kubectl delete sparkapplication taxi-trip-graviton -n spark-team-a
# NVMe storage is automatically cleaned up when nodes terminate
# Note: Graviton NodePools are shared and remain for other workloads
Next Steps
- 📊 Graviton Performance Benchmarks - Detailed performance analysis
- NVMe Instance Storage (x86) - x86 NVMe comparison
- EBS Dynamic PVC Storage - Production fault tolerance
- Infrastructure Setup - Deploy base infrastructure