Skip to main content

EMR on EKS Infrastructure Deployment

Deploy a production-ready Amazon EMR on EKS cluster with virtual clusters, IAM roles, and Karpenter autoscaling.

Architecture Overview

Prerequisites

  • AWS CLI configured with appropriate credentials
  • kubectl installed
  • Terraform >= 1.0
  • jq (for parsing JSON outputs)

Deployment Steps

1. Clone the Repository

git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks

2. Navigate to EMR on EKS Stack

cd data-stacks/emr-on-eks

3. Review Configuration

Edit terraform/data-stack.tfvars to customize your deployment:

# EMR on EKS Data Stack Configuration
name = "emr-on-eks"
region = "us-west-2"
deployment_id = "your-unique-id"

# Enable EMR on EKS Virtual Clusters
enable_emr_on_eks = true

# Optional: Enable additional addons
enable_ingress_nginx = true
enable_ipv6 = false

4. Deploy Infrastructure

./deploy.sh

This script will:

  1. Initialize Terraform
  2. Create VPC and networking (if not exists)
  3. Deploy EKS cluster with managed node groups
  4. Install Karpenter for autoscaling
  5. Install YuniKorn scheduler
  6. Create EMR virtual clusters for Team A and Team B
  7. Configure IAM roles and service accounts
  8. Set up S3 buckets for logs and data
Deployment Time

Initial deployment takes approximately 30-40 minutes. Subsequent updates are faster.

5. View Terraform Outputs

After deployment completes, view the infrastructure details:

cd terraform/_local
terraform output

You should see output similar to:

cluster_arn = "arn:aws:eks:us-west-2:123456789:cluster/emr-on-eks"
cluster_name = "emr-on-eks"
configure_kubectl = "aws eks --region us-west-2 update-kubeconfig --name emr-on-eks"
deployment_id = "abcdefg"
emr_on_eks = {
"emr-data-team-a" = {
"cloudwatch_log_group_name" = "/emr-on-eks-logs/emr-on-eks/emr-data-team-a"
"job_execution_role_arn" = "arn:aws:iam::123456789:role/emr-on-eks-emr-data-team-a"
"virtual_cluster_id" = "hclg71zute4fm4fpm3m2cobv0"
}
"emr-data-team-b" = {
"cloudwatch_log_group_name" = "/emr-on-eks-logs/emr-on-eks/emr-data-team-b"
"job_execution_role_arn" = "arn:aws:iam::123456789:role/emr-on-eks-emr-data-team-b"
"virtual_cluster_id" = "cqt781jwn4vq1wh4jlqdhpj5h"
}
}
emr_s3_bucket_name = "emr-on-eks-spark-logs-123456789"
region = "us-west-2"

6. Configure kubectl Access

Use the configure_kubectl output to authenticate with the cluster:

# Run the command from terraform output
aws eks --region us-west-2 update-kubeconfig --name emr-on-eks

# Verify cluster access
kubectl get nodes

# Check EMR namespaces
kubectl get namespaces | grep emr-data

# Expected output:
# emr-data-team-a Active 5m
# emr-data-team-b Active 5m

7. Verify Karpenter and YuniKorn

# Check Karpenter nodepools
kubectl get nodepool

# Check YuniKorn scheduler
kubectl get pods -n yunikorn-system

# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=20

What Gets Deployed

EKS Cluster Components

ComponentDescription
VPCMulti-AZ VPC with public and private subnets
EKS ClusterKubernetes 1.31+ with managed control plane
KarpenterNode autoscaling with multiple node pools
YuniKornAdvanced Kubernetes scheduler for batch workloads
EBS CSI DriverFor dynamic EBS volume provisioning
AWS Load Balancer ControllerFor ingress and service load balancing
Fluent BitLog forwarding to CloudWatch

EMR on EKS Resources

ResourceDescription
Virtual Cluster (Team A)EMR virtual cluster in emr-data-team-a namespace
Virtual Cluster (Team B)EMR virtual cluster in emr-data-team-b namespace
IAM RolesJob execution roles with S3, Glue, and CloudWatch permissions
Service AccountsKubernetes service accounts with IRSA
S3 BucketFor Spark logs, shuffle data, and results
CloudWatch Log GroupsFor EMR job logs

Karpenter Node Pools

Node PoolInstance TypesUse Case
Compute Optimized (Graviton)c6g, c7g, c8gGeneral Spark workloads
Memory Optimized (Graviton)r6g, r7g, r8gMemory-intensive jobs
NVMe SSD (Graviton)c6gd, c7gd, m6gd, r6gdHigh I/O shuffle operations

Configuration Options

Enable Additional Features

Edit terraform/data-stack.tfvars to enable optional features:

# Enable Spark History Server
enable_spark_history_server = true

# Enable JupyterHub
enable_jupyterhub = true

# Enable Prometheus & Grafana
enable_kube_prometheus_stack = true

# Enable Argo Workflows
enable_argo_workflows = true

Customize Node Pools

To modify Karpenter node pools, create an overlay file:

# Create custom node pool configuration
cp infra/terraform/manifests/karpenter/nodepool-compute-optimized-graviton.yaml \
terraform/manifests/karpenter/nodepool-compute-optimized-graviton.yaml

# Edit the file to customize instance types, limits, etc.

Customize EMR Virtual Clusters

To add more virtual clusters or modify existing ones, edit terraform/emr-on-eks.tf:

# Add a new virtual cluster
module "emr_on_eks_team_c" {
source = "../../infra/terraform/modules/emr-on-eks"

cluster_name = var.cluster_name
namespace = "emr-data-team-c"
virtual_cluster_name = "emr-data-team-c"

# Additional configuration...
}

Accessing the Cluster

Using kubectl

export KUBECONFIG=kubeconfig.yaml
kubectl get pods -n emr-data-team-a

Using AWS CLI

# List EMR virtual clusters
aws emr-containers list-virtual-clusters --region us-west-2

# Describe a virtual cluster
aws emr-containers describe-virtual-cluster \
--id $EMR_VIRTUAL_CLUSTER_ID_TEAM_A \
--region us-west-2

Monitoring and Observability

CloudWatch Logs

EMR job logs are automatically forwarded to CloudWatch:

# View logs in CloudWatch
aws logs tail $CLOUDWATCH_LOG_GROUP_TEAM_A --follow

Spark History Server

If enabled, access the Spark History Server:

# Get the Spark History Server URL
kubectl get ingress -n spark-history-server

Prometheus & Grafana

If enabled, access Grafana dashboards:

# Port forward to Grafana
kubectl port-forward -n kube-prometheus-stack \
svc/kube-prometheus-stack-grafana 3000:80

# Access at http://localhost:3000
# Default credentials: admin / prom-operator

Troubleshooting

Check Karpenter Logs

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100

Check YuniKorn Scheduler

kubectl logs -n yunikorn-system -l app=yunikorn --tail=100

Verify IAM Roles

# Check service account annotations
kubectl describe sa emr-containers-sa-spark-driver-* -n emr-data-team-a

Check Node Provisioning

# List node claims
kubectl get nodeclaims

# Describe a node claim
kubectl describe nodeclaim <nodeclaim-name>

Cleanup

To destroy all resources:

./cleanup.sh
Cleanup Time

Cleanup takes approximately 20-30 minutes to complete. Ensure all EMR jobs are terminated before cleanup.

Next Steps

Additional Resources