Skip to main content

Spark with Graviton NVMe Instance Storage

Achieve maximum performance with Apache Spark using ARM64 Graviton processors and direct NVMe SSD storage for ultra-low latency shuffle operations.

🚀 Recommended: Graviton instances provide superior price-performance with up to 40% better cost efficiency compared to x86 instances.

Prerequisites

  • Deploy Spark on EKS infrastructure: Infrastructure Setup
  • Latest generation Graviton instances with NVMe storage (c6gd, c7gd, r6gd, r7gd, m6gd, m7gd, i4g, im4gn families)
  • Karpenter RAID0 policy automatically formats and mounts available NVMe storage
Graviton Performance Advantage

AWS Graviton4 processors deliver up to 30% better compute performance and 75% more memory bandwidth than previous generation. Combined with NVMe storage, this provides the highest performance option for Spark workloads.

Architecture: Graviton ARM64 + Direct NVMe SSD Access

Key Benefits:

  • 🔥 Maximum Performance: Graviton4 + NVMe SSD combination
  • 💰 Best Price-Performance: Up to 40% cost savings vs x86
  • Zero Network Latency: Direct local storage access
  • 🌱 Sustainable: ARM64 architecture with better energy efficiency

Graviton Instance Families and NVMe Capacity

Instance FamilyNVMe StorageMemory RangevCPU RangeUse Case
c6gd118GB - 3.8TB8GB - 128GB2 - 32Graviton3 compute-optimized
c7gd118GB - 7.6TB8GB - 192GB2 - 48Recommended - Graviton3 latest
r6gd118GB - 3.8TB16GB - 512GB2 - 32Graviton3 memory-optimized
r7gd118GB - 7.6TB16GB - 768GB2 - 48Recommended - Graviton3 latest
m6gd118GB - 3.8TB8GB - 256GB2 - 32Graviton3 general-purpose
m7gd118GB - 7.6TB8GB - 384GB2 - 48Recommended - Graviton3 latest
i4g468GB - 30TB12GB - 384GB2 - 48Maximum NVMe storage

Performance Benchmarks

For detailed Graviton performance benchmarks and comparisons: 📊 Graviton Spark Benchmarks

Example Code

View the complete configuration:

📄 Complete Graviton NVMe Storage Configuration
examples/nvme-storage-graviton.yaml
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh

# This example demonstrates Graviton ARM64 with NVMe Instance Store Storage features
# Direct access to NVMe SSDs attached to ARM64 Graviton instances
# Maximum performance with local storage - no network I/O overhead
# Uses latest generation Graviton instances with NVMe storage (c6gd, c7gd, r6gd, r7gd, m6gd, m7gd, i4g, im4gn)
# Karpenter RAID0 policy automatically formats and mounts NVMe storage

---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: "taxi-trip-graviton"
namespace: spark-team-a
labels:
app: "taxi-trip-graviton"
queue: root.test
spec:
# To create Ingress object for Spark driver.
# Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
# sparkUIOptions:
# servicePort: 4040
# servicePortName: taxi-trip-ui-svc
# serviceType: ClusterIP
# ingressAnnotations:
# kubernetes.io/ingress.class: nginx
# nginx.ingress.kubernetes.io/use-regex: "true"
type: Python
sparkVersion: "3.5.3"
mode: cluster
image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
imagePullPolicy: IfNotPresent
mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py" # MainFile is the path to a bundled JAR, Python, or R file of the application
arguments:
- "s3a://$S3_BUCKET/taxi-trip/input/"
- "s3a://$S3_BUCKET/taxi-trip/output/"
sparkConf:
"spark.app.name": "taxi-trip-graviton"
"spark.kubernetes.driver.pod.name": "taxi-trip-graviton"
"spark.kubernetes.executor.podNamePrefix": "taxi-trip-graviton"
"spark.local.dir": "/data1"
"spark.speculation": "false"
"spark.network.timeout": "2400"

# NVMe Storage Performance Optimizations for Graviton ARM64
"spark.shuffle.spill.diskWriteBufferSize": "1048576" # 1MB buffer for NVMe
"spark.shuffle.file.buffer": "1m" # Larger buffer for local SSD
"spark.io.compression.codec": "lz4" # Fast compression for NVMe
"spark.shuffle.compress": "true"
"spark.shuffle.spill.compress": "true"
"spark.rdd.compress": "true"

# Local storage optimizations for NVMe on Graviton
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.coalescePartitions.enabled": "true"
"spark.sql.adaptive.localShuffleReader.enabled": "true"
"spark.sql.adaptive.skewJoin.enabled": "true"

# Optimize for high-performance local storage on ARM64
"spark.sql.files.maxPartitionBytes": "268435456" # 256MB for NVMe throughput
"spark.sql.shuffle.partitions": "400" # Optimize for parallelism
"spark.hadoop.fs.s3a.connection.timeout": "1200000"
"spark.hadoop.fs.s3a.path.style.access": "true"
"spark.hadoop.fs.s3a.connection.maximum": "200"
"spark.hadoop.fs.s3a.fast.upload": "true"
"spark.hadoop.fs.s3a.readahead.range": "256K"
"spark.hadoop.fs.s3a.input.fadvise": "random"
"spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
"spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider" # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
"spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"

# Spark Event logs
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
"spark.eventLog.rolling.enabled": "true"
"spark.eventLog.rolling.maxFileSize": "64m"
# "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100

# Expose Spark metrics for Prometheus
"spark.ui.prometheus.enabled": "true"
"spark.executor.processTreeMetrics.enabled": "true"
"spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
"spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
"spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"

# Graviton ARM64 NVMe Instance Store Storage Configuration
# Use direct NVMe SSD storage mounted by Karpenter RAID0 policy at /mnt/k8s-disks
# This provides maximum performance with local SSD access and no network overhead on ARM64
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20

driver:
initContainers:
- name: volume-permissions
image: busybox:1.36
command: [ 'sh', '-c', 'chown -R 185 /data1' ]
volumeMounts:
- mountPath: "/data1"
name: "spark-local-dir-1"
cores: 2 # Increased for NVMe workload coordination on Graviton
coreLimit: "2000m"
memory: "8g" # More memory for large dataset coordination
memoryOverhead: "2g" # 25% overhead, more reasonable
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-graviton"
karpenter.sh/capacity-type: "on-demand"
karpenter.k8s.aws/instance-family: "c6gd"
# Driver resource requirements: 2 vCPU, 10GB RAM (minimum c6gd.xlarge+)
executor:
initContainers:
- name: volume-permissions
image: busybox:1.36
command: [ 'sh', '-c', 'chown -R 185 /data1' ]
volumeMounts:
- mountPath: "/data1"
name: "spark-local-dir-1"
cores: 4 # Utilize Graviton NVMe instance capacity better
coreLimit: "4000m"
instances: 2
memory: "15g" # Scale up for high-performance workload on ARM64
memoryOverhead: "3g" # 20% overhead for high-throughput
serviceAccount: spark-team-a
labels:
version: 3.5.3
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-graviton"
karpenter.k8s.aws/instance-family: "c6gd"
# Executor resource requirements: 4 vCPU, 18GB RAM per executor (minimum c6gd.xlarge+)

Graviton NVMe Storage Configuration

Key configuration for ARM64 Graviton with direct NVMe SSD access:

Essential Graviton NVMe Settings
sparkConf:
# ARM64 NVMe Performance Optimizations
"spark.shuffle.spill.diskWriteBufferSize": "1048576" # 1MB buffer for NVMe
"spark.shuffle.file.buffer": "1m" # Larger buffer for local SSD
"spark.io.compression.codec": "lz4" # Fast compression optimized for ARM64

# Direct NVMe SSD access - Driver
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"

# Direct NVMe SSD access - Executor
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
"spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"

# Graviton NVMe node selection - targets specific NVMe instances
driver:
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-graviton"
karpenter.k8s.aws/instance-family: "c6gd" # Ensures NVMe instances
initContainers:
- name: volume-permissions
image: busybox:1.36 # Multi-arch image for ARM64 support
command: ['sh', '-c', 'chown -R 185 /data1']

executor:
nodeSelector:
node.kubernetes.io/workload-type: "compute-optimized-graviton"
karpenter.k8s.aws/instance-family: "c6gd" # Ensures NVMe instances

Features:

  • ARM64 Architecture: Native Graviton optimization
  • Auto-RAID0: Karpenter automatically configures RAID0 for multiple NVMe drives
  • Latest Generation: c7gd, r7gd, m7gd families with Graviton3 processors
  • Zero Network I/O: Direct access to local SSDs

Deploy and Test

1. Verify Existing Graviton NodePools

# Check existing Graviton NodePools (already include NVMe instances)
kubectl get nodepools -n karpenter compute-optimized-graviton memory-optimized-graviton

# These NodePools already include:
# - compute-optimized-graviton: c6gd, c7gd (compute + NVMe)
# - memory-optimized-graviton: r6gd, r7gd (memory + NVMe)

2. Execute Spark Job on Graviton

cd data-stacks/spark-on-eks/terraform/_local/

# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)

# Navigate to example directory
cd ../../examples/

# Submit the Graviton NVMe Storage job
envsubst < nvme-storage-graviton.yaml | kubectl apply -f -

# Monitor node provisioning (should show Graviton instances: c6gd/c7gd with NVMe)
kubectl get nodes -l node.kubernetes.io/workload-type=compute-optimized-graviton --watch

# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch

Expected output:

NAME                STATUS      ATTEMPTS   START                  FINISH                 AGE
taxi-trip-graviton COMPLETED 1 2025-09-28T17:03:31Z 2025-09-28T17:08:15Z 4m44s

Performance Comparison

Expected Performance Characteristics

MetricGraviton + NVMex86 + NVMeImprovement
Price-PerformanceBestGood40% better
Compute PerformanceHighHigh30% better
Memory BandwidthVery HighHigh75% more
Energy EfficiencyExcellentGood60% better

Why Choose Graviton for Spark

Superior for:

  • Cost-sensitive production workloads
  • Large-scale data processing
  • Memory-intensive analytics
  • Sustainable computing initiatives

Graviton Advantages:

  • Up to 40% better price-performance
  • Higher memory bandwidth for in-memory processing
  • Better energy efficiency
  • Native ARM64 ecosystem support

Cleanup

# Delete the Spark application
kubectl delete sparkapplication taxi-trip-graviton -n spark-team-a

# NVMe storage is automatically cleaned up when nodes terminate
# Note: Graviton NodePools are shared and remain for other workloads

Next Steps