EMR on EKS Infrastructure Deployment

Deploy a production-ready Amazon EMR on EKS cluster with virtual clusters, IAM roles, and Karpenter autoscaling.

Architecture Overview

Prerequisites

AWS CLI configured with appropriate credentials
kubectl installed
Terraform >= 1.0
jq (for parsing JSON outputs)

Deployment Steps

1. Clone the Repository

git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks

2. Navigate to EMR on EKS Stack

cd data-stacks/emr-on-eks

3. Review Configuration

Edit terraform/data-stack.tfvars to customize your deployment:

# EMR on EKS Data Stack Configuration
name                 = "emr-on-eks"
region               = "us-west-2"
deployment_id        = "your-unique-id"

# Enable EMR on EKS Virtual Clusters
enable_emr_on_eks    = true

# Optional: Enable additional addons
enable_ingress_nginx = true
enable_ipv6          = false

4. Deploy Infrastructure

./deploy.sh

This script will:

Initialize Terraform
Create VPC and networking (if not exists)
Deploy EKS cluster with managed node groups
Install Karpenter for autoscaling
Install YuniKorn scheduler
Create EMR virtual clusters for Team A and Team B
Configure IAM roles and service accounts
Set up S3 buckets for logs and data

Deployment Time

Initial deployment takes approximately 30-40 minutes. Subsequent updates are faster.

5. View Terraform Outputs

After deployment completes, view the infrastructure details:

cd terraform/_local
terraform output

You should see output similar to:

cluster_arn = "arn:aws:eks:us-west-2:123456789:cluster/emr-on-eks"
cluster_name = "emr-on-eks"
configure_kubectl = "aws eks --region us-west-2 update-kubeconfig --name emr-on-eks"
deployment_id = "abcdefg"
emr_on_eks = {
  "emr-data-team-a" = {
    "cloudwatch_log_group_name" = "/emr-on-eks-logs/emr-on-eks/emr-data-team-a"
    "job_execution_role_arn" = "arn:aws:iam::123456789:role/emr-on-eks-emr-data-team-a"
    "virtual_cluster_id" = "hclg71zute4fm4fpm3m2cobv0"
  }
  "emr-data-team-b" = {
    "cloudwatch_log_group_name" = "/emr-on-eks-logs/emr-on-eks/emr-data-team-b"
    "job_execution_role_arn" = "arn:aws:iam::123456789:role/emr-on-eks-emr-data-team-b"
    "virtual_cluster_id" = "cqt781jwn4vq1wh4jlqdhpj5h"
  }
}
emr_s3_bucket_name = "emr-on-eks-spark-logs-123456789"
region = "us-west-2"

6. Configure kubectl Access

Use the configure_kubectl output to authenticate with the cluster:

# Run the command from terraform output
aws eks --region us-west-2 update-kubeconfig --name emr-on-eks

# Verify cluster access
kubectl get nodes

# Check EMR namespaces
kubectl get namespaces | grep emr-data

# Expected output:
# emr-data-team-a   Active   5m
# emr-data-team-b   Active   5m

7. Verify Karpenter and YuniKorn

# Check Karpenter nodepools
kubectl get nodepool

# Check YuniKorn scheduler
kubectl get pods -n yunikorn-system

# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=20

What Gets Deployed

EKS Cluster Components

Component	Description
VPC	Multi-AZ VPC with public and private subnets
EKS Cluster	Kubernetes 1.31+ with managed control plane
Karpenter	Node autoscaling with multiple node pools
YuniKorn	Advanced Kubernetes scheduler for batch workloads
EBS CSI Driver	For dynamic EBS volume provisioning
AWS Load Balancer Controller	For ingress and service load balancing
Fluent Bit	Log forwarding to CloudWatch

EMR on EKS Resources

Resource	Description
Virtual Cluster (Team A)	EMR virtual cluster in `emr-data-team-a` namespace
Virtual Cluster (Team B)	EMR virtual cluster in `emr-data-team-b` namespace
IAM Roles	Job execution roles with S3, Glue, and CloudWatch permissions
Service Accounts	Kubernetes service accounts with IRSA
S3 Bucket	For Spark logs, shuffle data, and results
CloudWatch Log Groups	For EMR job logs

Karpenter Node Pools

Node Pool	Instance Types	Use Case
Compute Optimized (Graviton)	c6g, c7g, c8g	General Spark workloads
Memory Optimized (Graviton)	r6g, r7g, r8g	Memory-intensive jobs
NVMe SSD (Graviton)	c6gd, c7gd, m6gd, r6gd	High I/O shuffle operations

Configuration Options

Enable Additional Features

Edit terraform/data-stack.tfvars to enable optional features:

# Enable Spark History Server
enable_spark_history_server = true

# Enable JupyterHub
enable_jupyterhub = true

# Enable Prometheus & Grafana
enable_kube_prometheus_stack = true

# Enable Argo Workflows
enable_argo_workflows = true

Customize Node Pools

To modify Karpenter node pools, create an overlay file:

# Create custom node pool configuration
cp infra/terraform/manifests/karpenter/nodepool-compute-optimized-graviton.yaml \
   terraform/manifests/karpenter/nodepool-compute-optimized-graviton.yaml

# Edit the file to customize instance types, limits, etc.

Customize EMR Virtual Clusters

To add more virtual clusters or modify existing ones, edit terraform/emr-on-eks.tf:

# Add a new virtual cluster
module "emr_on_eks_team_c" {
  source = "../../infra/terraform/modules/emr-on-eks"

  cluster_name      = var.cluster_name
  namespace         = "emr-data-team-c"
  virtual_cluster_name = "emr-data-team-c"

  # Additional configuration...
}

Accessing the Cluster

Using kubectl

export KUBECONFIG=kubeconfig.yaml
kubectl get pods -n emr-data-team-a

Using AWS CLI

# List EMR virtual clusters
aws emr-containers list-virtual-clusters --region us-west-2

# Describe a virtual cluster
aws emr-containers describe-virtual-cluster \
  --id $EMR_VIRTUAL_CLUSTER_ID_TEAM_A \
  --region us-west-2

Monitoring and Observability

CloudWatch Logs

EMR job logs are automatically forwarded to CloudWatch:

# View logs in CloudWatch
aws logs tail $CLOUDWATCH_LOG_GROUP_TEAM_A --follow

Spark History Server

If enabled, access the Spark History Server:

# Get the Spark History Server URL
kubectl get ingress -n spark-history-server

Prometheus & Grafana

If enabled, access Grafana dashboards:

# Port forward to Grafana
kubectl port-forward -n kube-prometheus-stack \
  svc/kube-prometheus-stack-grafana 3000:80

# Access at http://localhost:3000
# Default credentials: admin / prom-operator

Troubleshooting

Check Karpenter Logs

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100

Check YuniKorn Scheduler

kubectl logs -n yunikorn-system -l app=yunikorn --tail=100

Verify IAM Roles

# Check service account annotations
kubectl describe sa emr-containers-sa-spark-driver-* -n emr-data-team-a

Check Node Provisioning

# List node claims
kubectl get nodeclaims

# Describe a node claim
kubectl describe nodeclaim <nodeclaim-name>

Cleanup

To destroy all resources:

./cleanup.sh

Cleanup Time

Cleanup takes approximately 20-30 minutes to complete. Ensure all EMR jobs are terminated before cleanup.

Next Steps

EBS Hostpath Storage Example - Cost-effective shared node storage
EBS PVC Storage Example - Dynamic volume provisioning
NVMe SSD Storage Example - Maximum I/O performance

Architecture Overview​

Prerequisites​

Deployment Steps​

1. Clone the Repository​

2. Navigate to EMR on EKS Stack​

3. Review Configuration​

4. Deploy Infrastructure​

5. View Terraform Outputs​

6. Configure kubectl Access​

7. Verify Karpenter and YuniKorn​

What Gets Deployed​

EKS Cluster Components​

EMR on EKS Resources​

Karpenter Node Pools​

Configuration Options​

Enable Additional Features​

Customize Node Pools​

Customize EMR Virtual Clusters​

Accessing the Cluster​

Using kubectl​

Using AWS CLI​

Monitoring and Observability​

CloudWatch Logs​

Spark History Server​

Prometheus & Grafana​

Troubleshooting​

Check Karpenter Logs​

Check YuniKorn Scheduler​

Verify IAM Roles​

Check Node Provisioning​

Cleanup​

Next Steps​

Additional Resources​