Ray Data on EKS - Infrastructure Deployment

Deploy production-ready Ray infrastructure on Amazon EKS for distributed computing, machine learning, and data processing workloads.

Architecture Overview

The deployment provisions a complete Ray platform on EKS with automatic scaling, monitoring, and GitOps-based management.

Components

Layer	Components	Purpose
AWS Infrastructure	VPC (dual CIDR), EKS v1.31+, S3, KMS	Network isolation, managed Kubernetes, storage, encryption
Platform	Karpenter, ArgoCD, Prometheus Stack	Node autoscaling, GitOps deployment, monitoring
Security	Pod Identity, cert-manager, external-secrets	IAM authentication, TLS, secrets management
Ray	KubeRay Operator v1.4.2, RayCluster/Job/Service CRDs	Cluster management, job orchestration

Network Design

VPC 10.0.0.0/16
├── Public Subnets (3 AZs)   → NAT Gateways, Load Balancers
├── Private Subnets (3 AZs)  → EKS control plane, Ray workloads
└── Secondary 100.64.0.0/16  → Karpenter-managed node pools

Prerequisites

Requirement	Version	Purpose
AWS CLI	v2.x	AWS resource management
Terraform	≥ 1.0	Infrastructure as Code
kubectl	≥ 1.28	Kubernetes management
Git	Latest	Repository cloning

IAM Permissions Required:

EKS: Cluster management
EC2: VPC, Security Groups, Instances
IAM: Roles, Policies, Pod Identity
S3: Bucket operations
CloudWatch: Logs and metrics
KMS: Encryption keys

Deployment

1. Clone Repository

git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks/data-stacks/ray-on-eks

2. Configure Stack

Edit terraform/data-stack.tfvars:

name                      = "ray-on-eks"      # Unique cluster name
region                    = "us-west-2"       # AWS region
enable_raydata            = true              # Deploy KubeRay operator
enable_ingress_nginx      = true              # Ray Dashboard access (optional)
enable_jupyterhub         = false             # Interactive notebooks (optional)
enable_amazon_prometheus  = false             # AWS Managed Prometheus (optional)

3. Deploy Infrastructure

./deploy.sh

Deployment Process (~25-30 minutes):

✅ Provision VPC, EKS cluster, S3 bucket
✅ Configure Karpenter for node autoscaling
✅ Deploy ArgoCD for GitOps
✅ Install KubeRay operator and platform addons
✅ Configure kubectl access

4. Verify Deployment

source set-env.sh

# Verify cluster
kubectl get nodes

# Check KubeRay operator
kubectl get pods -n kuberay-operator

# Verify ArgoCD applications
kubectl get applications -n argocd | grep -E "kuberay|karpenter"

Expected Output:

NAME               READY   STATUS    AGE
kuberay-operator   1/1     Running   5m

Access Dashboards

Ray Dashboard

# After deploying a RayCluster (see examples below)
kubectl port-forward -n raydata service/<raycluster-name>-head-svc 8265:8265

# Access: http://localhost:8265

ArgoCD UI

# Get password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

# Port forward
kubectl port-forward -n argocd svc/argocd-server 8080:443

# Access: https://localhost:8080 (admin / <password>)

Grafana Monitoring

# Get password
kubectl get secret -n kube-prometheus-stack kube-prometheus-stack-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d

# Port forward
kubectl port-forward -n kube-prometheus-stack svc/kube-prometheus-stack-grafana 3000:80

# Access: http://localhost:3000 (admin / <password>)

Pod Identity for AWS Access

The infrastructure automatically configures EKS Pod Identity for Ray workloads to securely access AWS services.

Automatic Configuration

IAM Role (raydata namespace):

S3 Access: Read/write to data buckets and Iceberg warehouse
AWS Glue: Create and manage Iceberg catalog tables
CloudWatch: Write logs and metrics
S3 Tables: Access S3 Tables API

Configuration Location: /infra/terraform/teams.tf and /infra/terraform/ray-operator.tf

Usage in Ray Jobs

Enable Pod Identity by specifying the service account:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: my-ray-job
  namespace: raydata
spec:
  submitterPodTemplate:
    spec:
      serviceAccountName: raydata  # ← Enables Pod Identity

  rayClusterSpec:
    headGroupSpec:
      template:
        spec:
          serviceAccountName: raydata  # ← Enables Pod Identity

    workerGroupSpecs:
      - template:
          spec:
            serviceAccountName: raydata  # ← Enables Pod Identity

Native S3 Access

PyArrow 21.0.0+ automatically uses Pod Identity credentials via AWS_CONTAINER_CREDENTIALS_FULL_URI. No boto3 configuration or credential helpers needed.

Example: Basic Ray Cluster

Deploy a Ray cluster with autoscaling:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-basic
  namespace: raydata
spec:
  rayVersion: '2.47.1'
  enableInTreeAutoscaling: true

  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        serviceAccountName: raydata
        containers:
        - name: ray-head
          image: rayproject/ray:2.47.1-py310
          resources:
            requests: {cpu: "1", memory: "2Gi"}
            limits: {cpu: "2", memory: "4Gi"}

  workerGroupSpecs:
  - replicas: 2
    minReplicas: 1
    maxReplicas: 10
    groupName: workers
    template:
      spec:
        serviceAccountName: raydata
        containers:
        - name: ray-worker
          image: rayproject/ray:2.47.1-py310
          resources:
            requests: {cpu: "2", memory: "4Gi"}
            limits: {cpu: "4", memory: "8Gi"}

Deploy:

kubectl apply -f raycluster-basic.yaml

# Verify
kubectl get rayclusters -n raydata
kubectl get pods -n raydata

Advanced Configuration

Stack Addons

Configure optional components in terraform/data-stack.tfvars:

Addon	Default	Purpose
`enable_raydata`	true	KubeRay operator
`enable_ingress_nginx`	false	Public dashboard access
`enable_jupyterhub`	false	Interactive notebooks
`enable_amazon_prometheus`	false	AWS Managed Prometheus

Karpenter Autoscaling

Karpenter automatically provisions nodes for Ray workloads. Configure NodePools for specific instance types:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: ray-compute
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          values: ["m5.2xlarge", "m5.4xlarge", "c5.4xlarge"]
  limits:
    cpu: "1000"

Troubleshooting

Operator Issues

# Check operator status
kubectl get pods -n kuberay-operator
kubectl logs -n kuberay-operator deployment/kuberay-operator --tail=50

# Verify ArgoCD application
kubectl get application kuberay-operator -n argocd

Common Issues:

Issue	Cause	Solution
ImagePullBackOff	Network connectivity	Check VPC NAT gateway, security groups
CrashLoopBackOff	Configuration error	Review operator logs for details
Pending pods	Insufficient capacity	Check Karpenter logs, node limits

Ray Cluster Issues

# Describe cluster
kubectl describe raycluster <name> -n raydata

# Check pod logs
kubectl logs <ray-head-pod> -n raydata

# Verify Karpenter provisioning
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100

Cost Optimization

Recommendations

Spot Instances: Karpenter uses spot by default (60-90% savings)
Right-Sizing: Start with smaller instances, scale based on metrics
Autoscaling: Enable Ray autoscaling to match workload demand
Node Consolidation: Karpenter automatically consolidates underutilized nodes

Monitor Costs

# View provisioned instances
kubectl get nodes -l karpenter.sh/capacity-type

# Check instance types
kubectl get nodes -L node.kubernetes.io/instance-type

Cleanup

# Delete Ray clusters
kubectl delete rayclusters --all -n raydata

# Destroy infrastructure
./cleanup.sh

Data Loss

This operation destroys all resources including S3 buckets. Back up important data before running cleanup.

Next Steps

Spark Logs Processing - Production example processing Spark logs with Ray Data and Iceberg
KubeRay Examples - Official Ray on Kubernetes guides
Ray Train - Distributed ML training
Ray Serve - Model serving

Architecture Overview​

Components​

Network Design​

Prerequisites​

Deployment​

1. Clone Repository​

2. Configure Stack​

3. Deploy Infrastructure​

4. Verify Deployment​

Access Dashboards​

Ray Dashboard​

ArgoCD UI​

Grafana Monitoring​

Pod Identity for AWS Access​

Automatic Configuration​

Usage in Ray Jobs​

Example: Basic Ray Cluster​

Advanced Configuration​

Stack Addons​

Karpenter Autoscaling​

Troubleshooting​

Operator Issues​

Ray Cluster Issues​

Cost Optimization​

Recommendations​

Monitor Costs​

Cleanup​

Next Steps​

Resources​