Skip to main content

Spark on EKS Infrastructure

Deploy a production-ready Apache Spark platform on Amazon EKS with GitOps, auto-scaling, and observability.

Architecture

This stack deploys Spark Operator on EKS with Karpenter for elastic node provisioning and ArgoCD for GitOps-based application management.

img.png

Prerequisites

Before deploying, ensure you have the following tools installed:

Step 1: Clone Repository & Navigate

git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks/data-stacks/spark-on-eks

Step 2: Customize Stack

Edit the stack configuration file to customize addons and settings:

# Edit configuration file
vi terraform/data-stack.tfvars

What Gets Deployed

This stack deploys a complete data platform with 30+ components automatically via GitOps (ArgoCD).

EKS Managed Addons

Deployed and managed by AWS EKS:

ComponentPurposeManaged By
corednsDNS resolutionEKS
kube-proxyNetwork proxyEKS
vpc-cniPod networking with prefix delegationEKS
eks-pod-identity-agentIAM roles for service accountsEKS
aws-ebs-csi-driverPersistent block storageEKS
aws-mountpoint-s3-csi-driverS3 as volumesEKS
metrics-serverResource metrics APIEKS
eks-node-monitoring-agentNode-level monitoringEKS

Core Platform Addons

Infrastructure components deployed via ArgoCD:

ComponentPurposeCategory
KarpenterNode autoscaling and bin-packingCompute
ArgoCDGitOps application deploymentPlatform
cert-managerTLS certificate automationSecurity
external-secretsAWS Secrets Manager integrationSecurity
ingress-nginxIngress controllerNetworking
aws-load-balancer-controllerALB/NLB integrationNetworking
kube-prometheus-stackPrometheus + Grafana monitoringObservability
aws-for-fluentbitLog aggregation to CloudWatchObservability

Data Platform Addons

Data processing and analytics tools:

ComponentPurposeUse Case
spark-operatorApache Spark on KubernetesBatch Processing
spark-history-serverSpark job history and metricsObservability
yunikornGang scheduling for batch jobsScheduling
jupyterhubInteractive notebooks (Python/Scala)Data Science
flink-operatorStream processing frameworkReal-time Analytics
strimzi-kafkaEvent streaming platformEvent Streaming
trinoDistributed SQL query engineData Lakehouse
argo-workflowsWorkflow orchestration (DAGs)Orchestration
argo-eventsEvent-driven workflow triggersEvent Processing
kedaEvent-driven pod autoscalingAutoscaling

Optional Addons

Configure in terraform/data-stack.tfvars:

terraform/data-stack.tfvars
name   = "spark-on-eks"
region = "us-west-2"

# Optional - disable if not needed
enable_ingress_nginx = true # Ingress controller (default: true)
enable_jupyterhub = true # Notebooks (default: true)

# Optional - enable for specific use cases
enable_celeborn = false # Remote shuffle service
enable_datahub = false # Metadata management
enable_superset = false # Data visualization
enable_raydata = false # Distributed ML/AI
enable_amazon_prometheus = false # Managed Prometheus
Customization

To see all available options, check infra/terraform/variables.tf

Step 3: Deploy Infrastructure

Run the deployment script:

./deploy.sh
note

If deployment fails:

  • Rerun the same command: ./deploy.sh
  • If it still fails, debug using kubectl commands or raise an issue
info

Expected deployment time: 15-20 minutes

Step 4: Verify Deployment

The deployment script automatically configures kubectl. Verify the cluster is ready:

# Set kubeconfig (done automatically by deploy.sh)
export KUBECONFIG=kubeconfig.yaml

# Verify cluster nodes
kubectl get nodes

# Check all namespaces
kubectl get namespaces

# Verify ArgoCD applications
kubectl get applications -n argocd
Quick Verification

Run these commands to verify successful deployment:

# 1. Check nodes are ready
kubectl get nodes
# Expected: 4-5 nodes with STATUS=Ready

# 2. Verify Spark Operator is running
kubectl get pods -n spark-operator
# Expected: spark-operator-controller and webhook pods Running

# 3. Check ArgoCD applications are synced
kubectl get applications -n argocd
# Expected: All apps showing "Synced" and "Healthy"

# 4. Verify Spark CRDs installed
kubectl get crds | grep spark
# Expected: sparkapplications.sparkoperator.k8s.io

# 5. Check Karpenter NodePools ready
kubectl get nodepools
# Expected: 5 pools with READY=True
Expected Output Examples

Nodes:

NAME                                           STATUS   ROLES    AGE     VERSION
ip-100-64-106-144.us-west-2.compute.internal Ready <none> 5m44s v1.33.5-eks-113cf36
ip-100-64-37-76.us-west-2.compute.internal Ready <none> 5m43s v1.33.5-eks-113cf36
...

Spark Operator:

NAME                                         READY   STATUS    RESTARTS   AGE
spark-operator-controller-6bc54d4658-hg2qd 1/1 Running 0 6m20s
spark-operator-webhook-5b5f58597d-hh6b2 1/1 Running 0 6m20s

ArgoCD Applications:

NAME                           SYNC STATUS   HEALTH STATUS
spark-operator Synced Healthy
spark-history-server Synced Healthy
kube-prometheus-stack Synced Healthy
karpenter Synced Healthy
cert-manager Synced Healthy
...

Karpenter NodePools:

NAME                         NODECLASS   NODES   READY   AGE
general-purpose default 0 True 13m
compute-optimized-x86 default 0 True 13m
compute-optimized-graviton default 0 True 13m
memory-optimized-x86 default 0 True 13m
memory-optimized-graviton default 0 True 13m

Step 5: Access ArgoCD UI

The deployment script displays ArgoCD credentials at the end. Access the UI:

# Port forward ArgoCD server
kubectl port-forward svc/argocd-server -n argocd 8080:443

Open https://localhost:8080 in your browser:

  • Username: admin
  • Password: Displayed at end of deploy.sh output
info

All applications should show Synced and Healthy status.

ArgoCD Applications

Step 6: Run Test Spark Job

Validate the deployment with a sample PySpark job:

# Get S3 bucket for Spark logs
cd terraform/_local
export S3_BUCKET=$(terraform output -raw s3_bucket_id_spark_history_server)

# Submit test job
cd ../../examples
envsubst < pyspark-pi-job.yaml | kubectl apply -f -

# Watch job status
kubectl get sparkapplications -n spark-team-a -w

What happens:

  1. Karpenter provisions nodes - Takes ~60s to launch compute-optimized instances
  2. Driver pod starts - Coordinates the Spark job execution
  3. Executor pods run - Perform the Pi calculation in parallel
  4. Job completes - Result: Pi is roughly 3.141640
  5. Logs stored in S3 - Accessible via Spark History Server

Expected output:

NAME                   STATUS      ATTEMPTS   START                  FINISH                 DURATION
pyspark-pi-karpenter COMPLETED 1 2025-10-21T20:08:52Z 2025-10-21T20:11:49Z ~3 min
View Job Details
# Watch job status in real-time
kubectl get sparkapplications -n spark-team-a -w

# View driver logs (shows Pi calculation result)
kubectl logs -n spark-team-a pyspark-pi-karpenter-driver

# Check detailed job status
kubectl describe sparkapplication pyspark-pi-karpenter -n spark-team-a

Troubleshooting

Common Issues

Pods stuck in Pending:

# Check node capacity
kubectl describe nodes

# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter

ArgoCD applications not syncing:

# Check ArgoCD application status
kubectl get applications -n argocd

# Check specific application
kubectl describe application spark-operator -n argocd

Next Steps

With infrastructure deployed, you can now run any Spark examples:

Cleanup

To remove all resources, use the dedicated cleanup script:

# Navigate to stack directory
cd data-on-eks/data-stacks/spark-on-eks

# Run cleanup script
./cleanup.sh
note

If cleanup fails:

  • Rerun the same command: ./cleanup.sh
  • Keep rerunning until all resources are deleted
  • Some AWS resources may have dependencies that require multiple cleanup attempts
warning

This command will delete all resources and data. Make sure to backup any important data first.