Spark with Graviton NVMe Instance Storage

Achieve maximum performance with Apache Spark using ARM64 Graviton processors and direct NVMe SSD storage for ultra-low latency shuffle operations.

🚀 Recommended: Graviton instances provide superior price-performance with up to 40% better cost efficiency compared to x86 instances.

Prerequisites

Deploy Spark on EKS infrastructure: Infrastructure Setup
Latest generation Graviton instances with NVMe storage (c6gd, c7gd, r6gd, r7gd, m6gd, m7gd, i4g, im4gn families)
Karpenter RAID0 policy automatically formats and mounts available NVMe storage

Graviton Performance Advantage

AWS Graviton4 processors deliver up to 30% better compute performance and 75% more memory bandwidth than previous generation. Combined with NVMe storage, this provides the highest performance option for Spark workloads.

Architecture: Graviton ARM64 + Direct NVMe SSD Access

Key Benefits:

🔥 Maximum Performance: Graviton4 + NVMe SSD combination
💰 Best Price-Performance: Up to 40% cost savings vs x86
⚡ Zero Network Latency: Direct local storage access
🌱 Sustainable: ARM64 architecture with better energy efficiency

Graviton Instance Families and NVMe Capacity

Instance Family	NVMe Storage	Memory Range	vCPU Range	Use Case
c6gd	118GB - 3.8TB	8GB - 128GB	2 - 32	Graviton3 compute-optimized
c7gd	118GB - 7.6TB	8GB - 192GB	2 - 48	Recommended - Graviton3 latest
r6gd	118GB - 3.8TB	16GB - 512GB	2 - 32	Graviton3 memory-optimized
r7gd	118GB - 7.6TB	16GB - 768GB	2 - 48	Recommended - Graviton3 latest
m6gd	118GB - 3.8TB	8GB - 256GB	2 - 32	Graviton3 general-purpose
m7gd	118GB - 7.6TB	8GB - 384GB	2 - 48	Recommended - Graviton3 latest
i4g	468GB - 30TB	12GB - 384GB	2 - 48	Maximum NVMe storage

Performance Benchmarks

For detailed Graviton performance benchmarks and comparisons: 📊 Graviton Spark Benchmarks

Example Code

View the complete configuration:

📄 Complete Graviton NVMe Storage Configuration

examples/nvme-storage-graviton.yaml
# Pre-requisite before running this job
# 1/ Open taxi-trip-execute.sh and update $S3_BUCKET and <REGION>
# 2/ Replace $S3_BUCKET with your S3 bucket created by this example (Check Terraform outputs)
# 3/ execute taxi-trip-execute.sh

# This example demonstrates Graviton ARM64 with NVMe Instance Store Storage features
  # Direct access to NVMe SSDs attached to ARM64 Graviton instances
  # Maximum performance with local storage - no network I/O overhead
  # Uses latest generation Graviton instances with NVMe storage (c6gd, c7gd, r6gd, r7gd, m6gd, m7gd, i4g, im4gn)
  # Karpenter RAID0 policy automatically formats and mounts NVMe storage

---
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: "taxi-trip-graviton"
  namespace: spark-team-a
  labels:
    app: "taxi-trip-graviton"
    queue: root.test
spec:
#  To create Ingress object for Spark driver.
#  Ensure Spark Operator Helm Chart deployed with Ingress enabled to use this feature
#  sparkUIOptions:
#    servicePort: 4040
#    servicePortName: taxi-trip-ui-svc
#    serviceType: ClusterIP
#    ingressAnnotations:
#      kubernetes.io/ingress.class: nginx
#      nginx.ingress.kubernetes.io/use-regex: "true"
  type: Python
  sparkVersion: "3.5.3"
  mode: cluster
  image: "public.ecr.aws/data-on-eks/spark:3.5.3-scala2.12-java17-python3-ubuntu"
  imagePullPolicy: IfNotPresent
  mainApplicationFile: "s3a://$S3_BUCKET/taxi-trip/scripts/pyspark-taxi-trip.py"  # MainFile is the path to a bundled JAR, Python, or R file of the application
  arguments:
    - "s3a://$S3_BUCKET/taxi-trip/input/"
    - "s3a://$S3_BUCKET/taxi-trip/output/"
  sparkConf:
    "spark.app.name": "taxi-trip-graviton"
    "spark.kubernetes.driver.pod.name": "taxi-trip-graviton"
    "spark.kubernetes.executor.podNamePrefix": "taxi-trip-graviton"
    "spark.local.dir": "/data1"
    "spark.speculation": "false"
    "spark.network.timeout": "2400"

    # NVMe Storage Performance Optimizations for Graviton ARM64
    "spark.shuffle.spill.diskWriteBufferSize": "1048576"  # 1MB buffer for NVMe
    "spark.shuffle.file.buffer": "1m"  # Larger buffer for local SSD
    "spark.io.compression.codec": "lz4"  # Fast compression for NVMe
    "spark.shuffle.compress": "true"
    "spark.shuffle.spill.compress": "true"
    "spark.rdd.compress": "true"

    # Local storage optimizations for NVMe on Graviton
    "spark.sql.adaptive.enabled": "true"
    "spark.sql.adaptive.coalescePartitions.enabled": "true"
    "spark.sql.adaptive.localShuffleReader.enabled": "true"
    "spark.sql.adaptive.skewJoin.enabled": "true"

    # Optimize for high-performance local storage on ARM64
    "spark.sql.files.maxPartitionBytes": "268435456"  # 256MB for NVMe throughput
    "spark.sql.shuffle.partitions": "400"  # Optimize for parallelism
    "spark.hadoop.fs.s3a.connection.timeout": "1200000"
    "spark.hadoop.fs.s3a.path.style.access": "true"
    "spark.hadoop.fs.s3a.connection.maximum": "200"
    "spark.hadoop.fs.s3a.fast.upload": "true"
    "spark.hadoop.fs.s3a.readahead.range": "256K"
    "spark.hadoop.fs.s3a.input.fadvise": "random"
    "spark.hadoop.fs.s3a.aws.credentials.provider.mapping": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider=software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"
    "spark.hadoop.fs.s3a.aws.credentials.provider": "software.amazon.awssdk.auth.credentials.ContainerCredentialsProvider"  # AWS SDK V2 https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/aws_sdk_upgrade.html
    "spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"

    # Spark Event logs
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "s3a://$S3_BUCKET/spark-event-logs"
    "spark.eventLog.rolling.enabled": "true"
    "spark.eventLog.rolling.maxFileSize": "64m"
    # "spark.history.fs.eventLog.rolling.maxFilesToRetain": 100

    # Expose Spark metrics for Prometheus
    "spark.ui.prometheus.enabled": "true"
    "spark.executor.processTreeMetrics.enabled": "true"
    "spark.metrics.conf.*.sink.prometheusServlet.class": "org.apache.spark.metrics.sink.PrometheusServlet"
    "spark.metrics.conf.driver.sink.prometheusServlet.path": "/metrics/driver/prometheus/"
    "spark.metrics.conf.executor.sink.prometheusServlet.path": "/metrics/executors/prometheus/"

    # Graviton ARM64 NVMe Instance Store Storage Configuration
    # Use direct NVMe SSD storage mounted by Karpenter RAID0 policy at /mnt/k8s-disks
    # This provides maximum performance with local SSD access and no network overhead on ARM64
    "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
    "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
    "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
    "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

    "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
    "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.type": "Directory"
    "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"
    "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly": "false"

  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20

  driver:
    initContainers:
      - name: volume-permissions
        image: busybox:1.36
        command: [ 'sh', '-c', 'chown -R 185 /data1' ]
        volumeMounts:
          - mountPath: "/data1"
            name: "spark-local-dir-1"
    cores: 2  # Increased for NVMe workload coordination on Graviton
    coreLimit: "2000m"
    memory: "8g"  # More memory for large dataset coordination
    memoryOverhead: "2g"  # 25% overhead, more reasonable
    serviceAccount: spark-team-a
    labels:
      version: 3.5.3
    nodeSelector:
      node.kubernetes.io/workload-type: "compute-optimized-graviton"
      karpenter.sh/capacity-type: "on-demand"
      karpenter.k8s.aws/instance-family: "c6gd"
    # Driver resource requirements: 2 vCPU, 10GB RAM (minimum c6gd.xlarge+)
  executor:
    initContainers:
      - name: volume-permissions
        image: busybox:1.36
        command: [ 'sh', '-c', 'chown -R 185 /data1' ]
        volumeMounts:
          - mountPath: "/data1"
            name: "spark-local-dir-1"
    cores: 4  # Utilize Graviton NVMe instance capacity better
    coreLimit: "4000m"
    instances: 2
    memory: "15g"  # Scale up for high-performance workload on ARM64
    memoryOverhead: "3g"  # 20% overhead for high-throughput
    serviceAccount: spark-team-a
    labels:
      version: 3.5.3
    nodeSelector:
      node.kubernetes.io/workload-type: "compute-optimized-graviton"
      karpenter.k8s.aws/instance-family: "c6gd"
    # Executor resource requirements: 4 vCPU, 18GB RAM per executor (minimum c6gd.xlarge+)

Graviton NVMe Storage Configuration

Key configuration for ARM64 Graviton with direct NVMe SSD access:

Essential Graviton NVMe Settings
sparkConf:
  # ARM64 NVMe Performance Optimizations
  "spark.shuffle.spill.diskWriteBufferSize": "1048576"  # 1MB buffer for NVMe
  "spark.shuffle.file.buffer": "1m"  # Larger buffer for local SSD
  "spark.io.compression.codec": "lz4"  # Fast compression optimized for ARM64

  # Direct NVMe SSD access - Driver
  "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
  "spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"

  # Direct NVMe SSD access - Executor
  "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path": "/mnt/k8s-disks"
  "spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path": "/data1"

# Graviton NVMe node selection - targets specific NVMe instances
driver:
  nodeSelector:
    node.kubernetes.io/workload-type: "compute-optimized-graviton"
    karpenter.k8s.aws/instance-family: "c6gd"  # Ensures NVMe instances
  initContainers:
    - name: volume-permissions
      image: busybox:1.36  # Multi-arch image for ARM64 support
      command: ['sh', '-c', 'chown -R 185 /data1']

executor:
  nodeSelector:
    node.kubernetes.io/workload-type: "compute-optimized-graviton"
    karpenter.k8s.aws/instance-family: "c6gd"  # Ensures NVMe instances

Features:

ARM64 Architecture: Native Graviton optimization
Auto-RAID0: Karpenter automatically configures RAID0 for multiple NVMe drives
Latest Generation: c7gd, r7gd, m7gd families with Graviton3 processors
Zero Network I/O: Direct access to local SSDs

Deploy and Test

1. Verify Existing Graviton NodePools

# Check existing Graviton NodePools (already include NVMe instances)
kubectl get nodepools -n karpenter compute-optimized-graviton memory-optimized-graviton

# These NodePools already include:
# - compute-optimized-graviton: c6gd, c7gd (compute + NVMe)
# - memory-optimized-graviton: r6gd, r7gd (memory + NVMe)

2. Execute Spark Job on Graviton

cd data-stacks/spark-on-eks/terraform/_local/

# Export S3 bucket and region from Terraform outputs
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)
export REGION=$(terraform output -raw region)

# Navigate to example directory
cd ../../examples/

# Submit the Graviton NVMe Storage job
envsubst < nvme-storage-graviton.yaml | kubectl apply -f -

# Monitor node provisioning (should show Graviton instances: c6gd/c7gd with NVMe)
kubectl get nodes -l node.kubernetes.io/workload-type=compute-optimized-graviton --watch

# Monitor job progress
kubectl get sparkapplications -n spark-team-a --watch

Expected output:

NAME                STATUS      ATTEMPTS   START                  FINISH                 AGE
taxi-trip-graviton  COMPLETED   1          2025-09-28T17:03:31Z   2025-09-28T17:08:15Z   4m44s

Performance Comparison

Expected Performance Characteristics

Metric	Graviton + NVMe	x86 + NVMe	Improvement
Price-Performance	Best	Good	40% better
Compute Performance	High	High	30% better
Memory Bandwidth	Very High	High	75% more
Energy Efficiency	Excellent	Good	60% better

Why Choose Graviton for Spark

✅ Superior for:

Cost-sensitive production workloads
Large-scale data processing
Memory-intensive analytics
Sustainable computing initiatives

✅ Graviton Advantages:

Up to 40% better price-performance
Higher memory bandwidth for in-memory processing
Better energy efficiency
Native ARM64 ecosystem support

Cleanup

# Delete the Spark application
kubectl delete sparkapplication taxi-trip-graviton -n spark-team-a

# NVMe storage is automatically cleaned up when nodes terminate
# Note: Graviton NodePools are shared and remain for other workloads

Next Steps

📊 Graviton Performance Benchmarks - Detailed performance analysis
NVMe Instance Storage (x86) - x86 NVMe comparison
EBS Dynamic PVC Storage - Production fault tolerance
Infrastructure Setup - Deploy base infrastructure

Prerequisites​

Architecture: Graviton ARM64 + Direct NVMe SSD Access​

Graviton Instance Families and NVMe Capacity​

Performance Benchmarks​

Example Code​

Graviton NVMe Storage Configuration​

Deploy and Test​

1. Verify Existing Graviton NodePools​

2. Execute Spark Job on Graviton​

Performance Comparison​

Expected Performance Characteristics​

Why Choose Graviton for Spark​

Cleanup​

Next Steps​