Inference-Ready EKS Cluster

The Inference-Ready EKS Cluster is a pre-configured infrastructure solution designed specifically for AI/ML inference workloads. This solution provides a Kubernetes cluster with all the necessary components to deploy and run inference services using the AI on EKS inference charts or your own deployments and models.

An expanded Readme is available that dives deep into the infrastructure. This page highlights the core components and reasoning for this architecture.

Why?

This Inference-Ready EKS Cluster was created to enable anyone to use EKS for inference, no matter where they are on their journey. LLMs can be challenging to deploy: depending on the model size, architecture, capabilities, and your requirements, different tools are needed to properly deploy, run, and scale the model. Additionally, you may want to run models that aren't just LLMs, such as text -> image diffusion models or more traditional Machine Learning models.

This infrastructure is meant to be the first layer of support. Alongside the Inference Charts and the Guidance you'll also find in this repository, AI on EKS aims to equip you with all the tools and knowledge you need to be able to run the inference you want.

Use Cases

Single Model Testing: If you just want to deploy a model and test it to see what it does or how well it does.
Multi-Node Distributed Inference: If the model is too big to fit on a single node.
Model Autoscaling: If you need to dynamically adjust the amount of models running based on incoming requests and queue length.
Model Benchmarking: If you want to understand how your model will perform on a given dataset with differing traffic patterns. This helps you adjust the model parameters and replicas to meet your Service Level Objectives ( SLOs).
Muti Model Architectures: If you want to test both LLMs and text->image diffusion models.

Introduction

This infrastructure creates an Amazon EKS cluster optimized for AI/ML inference workloads with the following key features:

KubeRay Operator: Enables distributed Ray workloads for scalable inference
LeaderWorkerSet: Enables multi-node distributed inference
AIBrix Stack: Advanced inference optimization and management capabilities
GPU/Neuron Support: Ready for both NVIDIA GPU and AWS Neuron (Inferentia/Trainium) workloads
AI/ML Observability Stack: Comprehensive monitoring and observability for ML workloads
Autoscaling: Karpenter-based node autoscaling for cost optimization

The cluster is specifically designed to work seamlessly with the AI on EKS Inference Charts, providing a complete end-to-end solution for deploying inference workloads.

Resources

This infrastructure deploys the following AWS resources:

Core Infrastructure

Amazon VPC with public and private subnets across multiple AZs
Amazon EKS Cluster (v1.33 by default)
EKS Managed Node Groups: Auto-scaling groups of EC2 instances that serve as EKS compute nodes.
NAT Gateways for private subnet internet access
Internet Gateway for public subnet access
Security Groups with appropriate ingress/egress rules

EKS Add-ons

AWS Load Balancer Controller for ingress management
EBS CSI Driver for persistent storage
VPC CNI for pod networking
CoreDNS for service discovery
Kube-proxy for service networking
Metrics Server for resource metrics
Amazon CloudWatch Observability for logging and monitoring

AI/ML Specific Components

KubeRay Operator for distributed Ray workloads
LeaderWorkerSet for multi-node distributed inference
NVIDIA Device Plugin for GPU resource management
AWS Neuron Device Plugin for Inferentia/Trainium support
Karpenter for intelligent node autoscaling

Observability Stack

Prometheus for metrics collection
Grafana for visualization and dashboards
AlertManager for alerting
Node Exporter for node-level metrics
DCGM Exporter for GPU metrics (when GPU nodes are present)

AIBrix Components

AIBrix Core for inference optimization
Gateway and routing for traffic management
Performance monitoring and optimization tools

Deployment

Architecture Diagram

architecture

Prerequisites

AWS CLI configured with appropriate permissions
Terraform (>= 1.0)
kubectl for cluster management
Helm (>= 3.0) for chart deployments

Step 0: Clone and Navigate

git clone https://github.com/awslabs/ai-on-eks.git
cd infra/solutions/inference-ready-cluster

Step 1: (Optional) Configure Variables

Edit the terraform/blueprint.tfvars file to customize your deployment:

name                             = "my-inference-cluster"
region                           = "us-west-2"
enable_kuberay_operator          = true
enable_ai_ml_observability_stack = true
enable_aibrix_stack              = true
enable_leader_worker_set         = true
availability_zones_count         = 4

All variables can be found in the variables.tf file

Step 2: Deploy Infrastructure

# Run the installation script
./install.sh

The installation script will do the following automatically:

Copy the base Terraform configuration
Initialize Terraform
Plan and apply the infrastructure
Configure kubectl context

Step 3: VPC

An Amazon Virtual Private Network (VPC) is provisioned and configured based on specified configuration. According to best practices for Reliability, 4 Availability zones (AZs) are configured to provide the best chance of node acquisition and high availability. Topology awareness defaults to keep AI/ML workloads in the same AZ for performance/cost, but is configurable for availability.

Step 4: EKS

Amazon Elastic Kubernetes Service (EKS) cluster is provisioned with Managed Nodes Group (MNG) that run critical cluster add-ons (CoreDNS, AWS Load Balancer Controller and Karpenter) on its compute nodes. Karpenter will manage compute capacity to other EKS add-ons, as well as inference applications that will be deployed by user while prioritizing the most cost-effective instances, per configuration.

Step 5: EKS add-ons

Other important EKS add-ons (LWS, KubeRay etc.) are deployed based on the configurations defined in the per-environment Terraform configuration file (See Step 1 above)

Step 6: Observability

An observability stack including FluentBit, Prometheus and Grafana is deployed to collect metrics and logs from the environment. Service and Pod Monitors are deployed to watch for AI/ML related workloads and collect metrics. Grafana dashboards are configured to automatically visualize the metrics and logs side by side.

Step 7: Cluster Ready

Users can access EKS API and can deploy containerized AI/ML inference workloads via Kubernetes CLI using the AI on EKS inference charts or other repositories by interacting with AWS Network Load Balancer (NLB) endpoint.

Step 8: Verify Deployment

To validate that everything deployed properly, you can run

kubectl get svc,pod,deployment -A

You should see the following output (expand the section to see the output)

Details

    NAMESPACE              NAME                                                             TYPE           CLUSTER-IP       EXTERNAL-IP                                                                     PORT(S)                                        AGE
    aibrix-system          service/aibrix-controller-manager-metrics-service                ClusterIP      172.20.218.39    <none>                                                                          8080/TCP                                       13d
    aibrix-system          service/aibrix-gateway-plugins                                   ClusterIP      172.20.142.245   <none>                                                                          50052/TCP                                      13d
    aibrix-system          service/aibrix-gpu-optimizer                                     ClusterIP      172.20.14.220    <none>                                                                          8080/TCP                                       13d
    aibrix-system          service/aibrix-kuberay-operator                                  ClusterIP      172.20.240.255   <none>                                                                          8080/TCP                                       13d
    aibrix-system          service/aibrix-metadata-service                                  ClusterIP      172.20.252.24    <none>                                                                          8090/TCP                                       13d
    aibrix-system          service/aibrix-redis-master                                      ClusterIP      172.20.155.43    <none>                                                                          6379/TCP                                       13d
    argocd                 service/argocd-applicationset-controller                         ClusterIP      172.20.139.94    <none>                                                                          7000/TCP                                       13d
    argocd                 service/argocd-dex-server                                        ClusterIP      172.20.127.60    <none>                                                                          5556/TCP,5557/TCP                              13d
    argocd                 service/argocd-redis                                             ClusterIP      172.20.48.202    <none>                                                                          6379/TCP                                       13d
    argocd                 service/argocd-repo-server                                       ClusterIP      172.20.232.147   <none>                                                                          8081/TCP                                       13d
    argocd                 service/argocd-server                                            ClusterIP      172.20.233.191   <none>                                                                          80/TCP,443/TCP                                 13d
    default                service/etcd-client                                              ClusterIP      172.20.47.224    <none>                                                                          2379/TCP                                       12d
    default                service/etcd-server                                              ClusterIP      172.20.69.95     <none>                                                                          2379/TCP,2380/TCP                              12d
    default                service/kubernetes                                               ClusterIP      172.20.0.1       <none>                                                                          443/TCP                                        13d
    envoy-gateway-system   service/envoy-aibrix-system-aibrix-eg-903790dc                   ClusterIP      172.20.249.100   <none>                                                                          80/TCP                                         13d
    envoy-gateway-system   service/envoy-gateway                                            ClusterIP      172.20.113.229   <none>                                                                          18000/TCP,18001/TCP,18002/TCP,19001/TCP        13d
    ingress-nginx          service/ingress-nginx-controller                                 LoadBalancer   172.20.27.209    k8s-ingressn-ingressn-ffa534dcb1-b4b54bcc24eaeddd.elb.us-west-2.amazonaws.com   80:31646/TCP,443:32024/TCP                     13d
    ingress-nginx          service/ingress-nginx-controller-admission                       ClusterIP      172.20.249.118   <none>                                                                          443/TCP                                        13d
    karpenter              service/karpenter                                                ClusterIP      172.20.149.70    <none>                                                                          8080/TCP                                       13d
    kube-system            service/aws-load-balancer-webhook-service                        ClusterIP      172.20.83.104    <none>                                                                          443/TCP                                        13d
    kube-system            service/eks-extension-metrics-api                                ClusterIP      172.20.87.142    <none>                                                                          443/TCP                                        13d
    kube-system            service/k8s-neuron-scheduler                                     ClusterIP      172.20.248.128   <none>                                                                          12345/TCP                                      13d
    kube-system            service/kube-dns                                                 ClusterIP      172.20.0.10      <none>                                                                          53/UDP,53/TCP,9153/TCP                         13d
    kube-system            service/kube-prometheus-stack-kubelet                            ClusterIP      None             <none>                                                                          10250/TCP,10255/TCP,4194/TCP                   13d
    kuberay-operator       service/kuberay-operator                                         ClusterIP      172.20.117.159   <none>                                                                          8080/TCP                                       13d
    lws-system             service/lws-controller-manager-metrics-service                   ClusterIP      172.20.17.186    <none>                                                                          8443/TCP                                       13d
    lws-system             service/lws-webhook-service                                      ClusterIP      172.20.173.201   <none>                                                                          443/TCP                                        13d
    monitoring             service/alertmanager-operated                                    ClusterIP      None             <none>                                                                          9093/TCP,9094/TCP,9094/UDP                     13d
    monitoring             service/dcgm-exporter                                            ClusterIP      172.20.79.5      <none>                                                                          9400/TCP                                       13d
    monitoring             service/fluent-bit                                               ClusterIP      172.20.111.213   <none>                                                                          2020/TCP                                       13d
    monitoring             service/kube-prometheus-stack-alertmanager                       ClusterIP      172.20.45.163    <none>                                                                          9093/TCP,8080/TCP                              13d
    monitoring             service/kube-prometheus-stack-coredns                            ClusterIP      None             <none>                                                                          9153/TCP                                       13d
    monitoring             service/kube-prometheus-stack-grafana                            ClusterIP      172.20.251.144   <none>                                                                          80/TCP                                         13d
    monitoring             service/kube-prometheus-stack-kube-controller-manager            ClusterIP      None             <none>                                                                          10257/TCP                                      13d
    monitoring             service/kube-prometheus-stack-kube-etcd                          ClusterIP      None             <none>                                                                          2381/TCP                                       13d
    monitoring             service/kube-prometheus-stack-kube-proxy                         ClusterIP      None             <none>                                                                          10249/TCP                                      13d
    monitoring             service/kube-prometheus-stack-kube-scheduler                     ClusterIP      None             <none>                                                                          10259/TCP                                      13d
    monitoring             service/kube-prometheus-stack-kube-state-metrics                 ClusterIP      172.20.81.57     <none>                                                                          8080/TCP                                       13d
    monitoring             service/kube-prometheus-stack-operator                           ClusterIP      172.20.163.90    <none>                                                                          443/TCP                                        13d
    monitoring             service/kube-prometheus-stack-prometheus                         ClusterIP      172.20.1.251     <none>                                                                          9090/TCP,8080/TCP                              13d
    monitoring             service/kube-prometheus-stack-prometheus-node-exporter           ClusterIP      172.20.88.160    <none>                                                                          9100/TCP                                       13d
    monitoring             service/my-cluster                                               ClusterIP      172.20.54.44     <none>                                                                          9200/TCP,9300/TCP,9600/TCP,9650/TCP            13d
    monitoring             service/my-cluster-dashboards                                    ClusterIP      172.20.161.35    <none>                                                                          5601/TCP                                       13d
    monitoring             service/my-cluster-masters                                       ClusterIP      None             <none>                                                                          9200/TCP,9300/TCP                              13d
    monitoring             service/opencost                                                 ClusterIP      172.20.162.78    <none>                                                                          9003/TCP,9090/TCP                              13d
    monitoring             service/opensearch-discovery                                     ClusterIP      None             <none>                                                                          9300/TCP                                       13d
    monitoring             service/opensearch-operator-controller-manager-metrics-service   ClusterIP      172.20.183.236   <none>                                                                          8443/TCP                                       13d
    monitoring             service/prometheus-operated                                      ClusterIP      None             <none>                                                                          9090/TCP                                       13d

    NAMESPACE              NAME                                                                  READY   STATUS      RESTARTS        AGE
    aibrix-system          pod/aibrix-controller-manager-5948f8f8b7-qjm7z                        1/1     Running     0               13d
    aibrix-system          pod/aibrix-gateway-plugins-5978d98445-qj2jw                           1/1     Running     0               13d
    aibrix-system          pod/aibrix-gpu-optimizer-64c978ddd8-bw7hk                             1/1     Running     0               13d
    aibrix-system          pod/aibrix-kuberay-operator-8b65d7cc4-xrcm6                           1/1     Running     0               13d
    aibrix-system          pod/aibrix-metadata-service-5499dc64b7-69tzc                          1/1     Running     0               13d
    aibrix-system          pod/aibrix-redis-master-576767646c-w9lhl                              1/1     Running     0               13d
    argocd                 pod/argocd-application-controller-0                                   1/1     Running     0               13d
    argocd                 pod/argocd-applicationset-controller-6847f76cb8-svwvt                 1/1     Running     0               13d
    argocd                 pod/argocd-dex-server-f6d74975f-g5rj8                                 1/1     Running     0               13d
    argocd                 pod/argocd-notifications-controller-86f4bb887d-sgxlb                  1/1     Running     0               13d
    argocd                 pod/argocd-redis-588f9bcd4d-9tncd                                     1/1     Running     0               13d
    argocd                 pod/argocd-repo-server-5cbcc778f4-kd4ll                               1/1     Running     0               13d
    argocd                 pod/argocd-server-7c9898bc58-vfqwn                                    1/1     Running     0               13d
    envoy-gateway-system   pod/envoy-aibrix-system-aibrix-eg-903790dc-567ff75b87-22ctt           2/2     Running     0               13d
    envoy-gateway-system   pod/envoy-gateway-6d7859b6bf-6hhf5                                    1/1     Running     0               13d
    ingress-nginx          pod/ingress-nginx-controller-58f4c5584-wt6rk                          1/1     Running     0               13d
    karpenter              pod/karpenter-849fd44788-4fgml                                        1/1     Running     0               13d
    karpenter              pod/karpenter-849fd44788-zbm9z                                        1/1     Running     0               13d
    kube-system            pod/aws-load-balancer-controller-c495bf799-crnlh                      1/1     Running     0               13d
    kube-system            pod/aws-load-balancer-controller-c495bf799-nwkqv                      1/1     Running     0               13d
    kube-system            pod/aws-node-6ff6l                                                    2/2     Running     0               8d
    kube-system            pod/aws-node-728vt                                                    2/2     Running     0               2d16h
    kube-system            pod/aws-node-87jfl                                                    2/2     Running     0               13d
    kube-system            pod/aws-node-wtnlj                                                    2/2     Running     0               13d
    kube-system            pod/aws-node-zzc4g                                                    2/2     Running     0               2d16h
    kube-system            pod/coredns-7bf648ff5d-98bp4                                          1/1     Running     0               13d
    kube-system            pod/coredns-7bf648ff5d-w56nm                                          1/1     Running     0               13d
    kube-system            pod/ebs-csi-controller-5bdc7bfdb6-79658                               6/6     Running     0               13d
    kube-system            pod/ebs-csi-controller-5bdc7bfdb6-958zf                               6/6     Running     0               13d
    kube-system            pod/ebs-csi-node-4z2mb                                                3/3     Running     0               13d
    kube-system            pod/ebs-csi-node-8qq2s                                                3/3     Running     0               2d16h
    kube-system            pod/ebs-csi-node-q9h5r                                                3/3     Running     0               2d16h
    kube-system            pod/ebs-csi-node-t77j9                                                3/3     Running     0               13d
    kube-system            pod/ebs-csi-node-w9mh8                                                3/3     Running     0               8d
    kube-system            pod/eks-pod-identity-agent-jjfz4                                      1/1     Running     0               13d
    kube-system            pod/eks-pod-identity-agent-jthdk                                      1/1     Running     0               2d16h
    kube-system            pod/eks-pod-identity-agent-ng556                                      1/1     Running     0               8d
    kube-system            pod/eks-pod-identity-agent-q6ths                                      1/1     Running     0               2d16h
    kube-system            pod/eks-pod-identity-agent-rwkr9                                      1/1     Running     0               13d
    kube-system            pod/k8s-neuron-scheduler-56f6c8bd67-hbzgz                             1/1     Running     0               13d
    kube-system            pod/kube-proxy-4wf7s                                                  1/1     Running     0               2d16h
    kube-system            pod/kube-proxy-7dm2x                                                  1/1     Running     0               2d16h
    kube-system            pod/kube-proxy-9d9cm                                                  1/1     Running     0               8d
    kube-system            pod/kube-proxy-lt4sp                                                  1/1     Running     0               13d
    kube-system            pod/kube-proxy-nklwj                                                  1/1     Running     0               13d
    kube-system            pod/my-scheduler-6959876cb4-gprm5                                     1/1     Running     0               13d
    kuberay-operator       pod/kuberay-operator-6d988d7dd9-ncx4h                                 1/1     Running     0               13d
    lws-system             pod/lws-controller-manager-cbb85458b-7cvhr                            1/1     Running     0               13d
    lws-system             pod/lws-controller-manager-cbb85458b-dqj8g                            1/1     Running     0               13d
    monitoring             pod/alertmanager-kube-prometheus-stack-alertmanager-0                 2/2     Running     0               13d
    monitoring             pod/fluent-bit-52m29                                                  1/1     Running     0               2d16h
    monitoring             pod/fluent-bit-hb824                                                  1/1     Running     0               13d
    monitoring             pod/fluent-bit-hsptw                                                  1/1     Running     0               13d
    monitoring             pod/fluent-bit-nkmrq                                                  1/1     Running     0               8d
    monitoring             pod/fluent-bit-qdsm2                                                  1/1     Running     0               2d16h
    monitoring             pod/fluent-operator-7f75b8ccf4-z5924                                  1/1     Running     0               13d
    monitoring             pod/kube-prometheus-stack-grafana-c64f79c4f-zqlm7                     3/3     Running     0               13d
    monitoring             pod/kube-prometheus-stack-kube-state-metrics-77976dc6c4-fff28         1/1     Running     0               13d
    monitoring             pod/kube-prometheus-stack-operator-6655669d75-4kh9s                   1/1     Running     0               13d
    monitoring             pod/kube-prometheus-stack-prometheus-node-exporter-7xbrs              1/1     Running     0               13d
    monitoring             pod/kube-prometheus-stack-prometheus-node-exporter-gwkb2              1/1     Running     0               13d
    monitoring             pod/kube-prometheus-stack-prometheus-node-exporter-k6zl7              1/1     Running     0               2d16h
    monitoring             pod/kube-prometheus-stack-prometheus-node-exporter-pl6m7              1/1     Running     0               2d16h
    monitoring             pod/kube-prometheus-stack-prometheus-node-exporter-st9kp              1/1     Running     0               8d
    monitoring             pod/opencost-bd64bfbf5-jbfvr                                          2/2     Running     0               13d
    monitoring             pod/opensearch-dashboards-84675f8b9-6jd2h                             1/1     Running     0               13d
    monitoring             pod/opensearch-dashboards-84675f8b9-9mcs2                             1/1     Running     0               13d
    monitoring             pod/opensearch-masters-0                                              1/1     Running     0               13d
    monitoring             pod/opensearch-masters-1                                              1/1     Running     0               8d
    monitoring             pod/opensearch-masters-2                                              1/1     Running     0               8d
    monitoring             pod/opensearch-operator-controller-manager-58b76955b9-w46gl           2/2     Running     0               13d
    monitoring             pod/opensearch-securityconfig-update-4fdcz                            0/1     Completed   0               13d
    monitoring             pod/prometheus-kube-prometheus-stack-prometheus-0                     2/2     Running     0               13d
    nvidia-device-plugin   pod/nvidia-device-plugin-node-feature-discovery-master-77b96ddp8h25   1/1     Running     0               13d

    NAMESPACE              NAME                                                                 READY   UP-TO-DATE   AVAILABLE   AGE
    aibrix-system          deployment.apps/aibrix-controller-manager                            1/1     1            1           13d
    aibrix-system          deployment.apps/aibrix-gateway-plugins                               1/1     1            1           13d
    aibrix-system          deployment.apps/aibrix-gpu-optimizer                                 1/1     1            1           13d
    aibrix-system          deployment.apps/aibrix-kuberay-operator                              1/1     1            1           13d
    aibrix-system          deployment.apps/aibrix-metadata-service                              1/1     1            1           13d
    aibrix-system          deployment.apps/aibrix-redis-master                                  1/1     1            1           13d
    argocd                 deployment.apps/argocd-applicationset-controller                     1/1     1            1           13d
    argocd                 deployment.apps/argocd-dex-server                                    1/1     1            1           13d
    argocd                 deployment.apps/argocd-notifications-controller                      1/1     1            1           13d
    argocd                 deployment.apps/argocd-redis                                         1/1     1            1           13d
    argocd                 deployment.apps/argocd-repo-server                                   1/1     1            1           13d
    argocd                 deployment.apps/argocd-server                                        1/1     1            1           13d
    envoy-gateway-system   deployment.apps/envoy-aibrix-system-aibrix-eg-903790dc               1/1     1            1           13d
    envoy-gateway-system   deployment.apps/envoy-gateway                                        1/1     1            1           13d
    ingress-nginx          deployment.apps/ingress-nginx-controller                             1/1     1            1           13d
    karpenter              deployment.apps/karpenter                                            2/2     2            2           13d
    kube-system            deployment.apps/aws-load-balancer-controller                         2/2     2            2           13d
    kube-system            deployment.apps/coredns                                              2/2     2            2           13d
    kube-system            deployment.apps/ebs-csi-controller                                   2/2     2            2           13d
    kube-system            deployment.apps/k8s-neuron-scheduler                                 1/1     1            1           13d
    kube-system            deployment.apps/my-scheduler                                         1/1     1            1           13d
    kuberay-operator       deployment.apps/kuberay-operator                                     1/1     1            1           13d
    lws-system             deployment.apps/lws-controller-manager                               2/2     2            2           13d
    monitoring             deployment.apps/fluent-operator                                      1/1     1            1           13d
    monitoring             deployment.apps/kube-prometheus-stack-grafana                        1/1     1            1           13d
    monitoring             deployment.apps/kube-prometheus-stack-kube-state-metrics             1/1     1            1           13d
    monitoring             deployment.apps/kube-prometheus-stack-operator                       1/1     1            1           13d
    monitoring             deployment.apps/opencost                                             1/1     1            1           13d
    monitoring             deployment.apps/opensearch-dashboards                                2/2     2            2           13d
    monitoring             deployment.apps/opensearch-operator-controller-manager               1/1     1            1           13d
    nvidia-device-plugin   deployment.apps/nvidia-device-plugin-node-feature-discovery-master   1/1     1            1           13d

Inference on EKS

EKS is a powerful platform for running AI/ML inference. For a deep dive on many of the inference possibilities on EKS, please check the inference section.

Inference Charts Integration

This infrastructure is specifically designed to work with the AI on EKS Inference Charts. The cluster provides all the necessary components and configurations for seamless deployment of inference workloads.

Prerequisites for Inference Charts

The infrastructure automatically provides:

KubeRay Operator - Required for Ray-vLLM deployments
GPU/Neuron Device Plugins - For hardware resource management
Observability Stack - Prometheus and Grafana for monitoring
AIBrix Integration - For inference optimization and management

Supported Inference Patterns

The cluster supports all inference patterns provided by the inference charts:

vLLM Deployments

Direct vLLM deployment using Kubernetes Deployment
Suitable for single-node inference workloads
Supports both GPU and Neuron accelerators

Ray-vLLM Deployments

Distributed vLLM on Ray Serve
Automatic scaling based on workload demand
Advanced observability with Prometheus/Grafana integration
Topology-aware scheduling for optimal performance

AIBrix Deployments

AIBrix backed LLM deployments
Efficient LLM routing for multiple replicas
Supports mixed GPU and Neuron accelerators

Example Deployments

Once your cluster is ready, you can deploy inference workloads:

# Navigate to inference charts
cd ../../../blueprints/inference/inference-charts

# Create Hugging Face token secret
kubectl create secret generic hf-token --from-literal=token=your_hf_token

# Deploy GPU Ray-vLLM with Llama model
helm install llama-inference . \
  --values values-llama-32-1b-ray-vllm.yaml

# Deploy Neuron vLLM with optimized model
helm install neuron-inference . \
  --values values-llama-31-8b-vllm-neuron.yaml

Please check the inference charts section for a deeper look at what is available.

Observability Integration

The infrastructure provides comprehensive observability for inference workloads:

Prometheus Metrics: Automatic collection of inference metrics
Grafana Dashboards: Pre-configured dashboards for Ray and vLLM
Log Aggregation: Centralized logging with Fluent Bit
GPU/Neuron Monitoring: Hardware utilization metrics

Access Grafana dashboard:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

Please see the observability for an in-depth look at using the observability features.

Cost Optimization

The cluster includes several cost optimization features:

Karpenter Autoscaling: Automatic node provisioning and deprovisioning
Spot Instance Support: Configure node groups to use spot instances
Topology Awareness: Efficient resource utilization across AZs
Resource Limits: Proper resource requests and limits for workloads

Troubleshooting

This section covers common issues you may encounter when deploying and operating the inference-ready EKS cluster, along with detailed solutions and diagnostic steps.

Deployment Issues

1. Terraform Apply Failures

Symptoms:

Terraform fails during terraform apply with resource creation errors
Module-specific failures during sequential deployment

Common Causes & Solutions:

Insufficient AWS Permissions:

# Verify your AWS credentials and permissions
aws sts get-caller-identity
aws iam get-user

# Required permissions include:
# - EKS cluster creation and management
# - EC2 instance management
# - VPC and networking resources
# - IAM role creation and attachment
# - KMS key management

Service Quota Limits:

# Check EC2 service quotas
aws service-quotas get-service-quota --service-code ec2 --quota-code L-1216C47A  # Running On-Demand instances
aws service-quotas get-service-quota --service-code ec2 --quota-code L-34B43A08  # Running On-Demand G instances
aws service-quotas get-service-quota --service-code ec2 --quota-code L-6E869C2A  # Running On-Demand Inf instances

# Request quota increases if needed
aws service-quotas request-service-quota-increase --service-code ec2 --quota-code L-34B43A08 --desired-value 32

Region Availability:

# Verify instance types are available in your region
aws ec2 describe-instance-type-offerings --location-type availability-zone --filters Name=instance-type,Values=g5.xlarge,inf2.xlarge

2. EKS Cluster Creation Issues

Symptoms:

EKS cluster fails to create or becomes stuck in "CREATING" state
Node groups fail to join the cluster

Diagnostic Steps:

# Check cluster status
aws eks describe-cluster --name inference-cluster --region us-west-2

Common Solutions:

Ensure VPC has sufficient IP addresses across 4 availability zones
Verify NAT Gateway creation in public subnets
Check security group configurations allow required EKS communication

Node and Pod Issues

3. Pods Stuck in Pending State

Symptoms:

Inference workloads remain in "Pending" status
Karpenter not provisioning nodes

Diagnostic Commands:

# Check pod events and resource requests
kubectl describe pod <pod-name> -n <namespace>

# Check Karpenter logs
kubectl logs -n karpenter deployment/karpenter

# Check available nodes and their capacity
kubectl get nodes -o wide
kubectl describe nodes

Common Causes & Solutions:

Insufficient GPU/Neuron Quotas:

# Verify Karpenter NodePool configurations
kubectl get nodepool -o yaml

4. GPU Detection and Device Plugin Issues

Symptoms:

GPU nodes show 0 allocatable GPUs
NVIDIA device plugin not running

Diagnostic Steps:

# Verify GPU visibility on nodes
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, gpus: .status.allocatable["nvidia.com/gpu"]}'

# Check node labels
kubectl get nodes --show-labels | grep gpu

Solutions:

# Restart NVIDIA device plugin if needed
kubectl delete pods -n nvidia-device-plugin -l app.kubernetes.io/name=nvidia-device-plugin

5. AWS Neuron Setup Issues

Symptoms:

Neuron devices not detected on inf2/trn1 instances
Neuron device plugin failing

Diagnostic Commands:

# Check Neuron device plugin
kubectl get pods -n kube-system | grep neuron

# Verify Neuron devices
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, neuron: .status.allocatable["aws.amazon.com/neuron"]}'

# Check Neuron scheduler
kubectl get pods -n kube-system | grep my-scheduler

Solutions:

# Verify Neuron runtime installation
kubectl describe node <inf2-node> | grep neuron

# Check Neuron device plugin logs
kubectl logs -n kube-system <neuron-device-plugin-pod>

Model Deployment Issues

6. Model Download Failures

Symptoms:

Pods fail to start with image pull or model download errors
Hugging Face authentication failures

Diagnostic Steps:

# Check pod logs for download errors
kubectl logs <pod-name> -n <namespace>

# Verify Hugging Face token secret
kubectl get secret hf-token -o yaml
kubectl get secret hf-token -o jsonpath='{.data.token}' | base64 -d

Solutions:

# Recreate Hugging Face token secret
kubectl delete secret hf-token
kubectl create secret generic hf-token --from-literal=token=<your-hf-token>

# Check internet connectivity from pods
kubectl run test-pod --image=curlimages/curl -it --rm -- curl -I https://huggingface.co

7. Out of Memory (OOM) Issues

Symptoms:

Pods getting killed with OOMKilled status
Models failing to load completely

Diagnostic Commands:

# Check pod resource usage
kubectl top pods -n <namespace>

# Check node memory usage
kubectl top nodes

# Review pod events for OOM kills
kubectl get events --field-selector reason=OOMKilling

Solutions:

# Increase instance type to get larger GPU
# Consider using larger instance types or model sharding

Networking and Load Balancer Issues

8. Service Connectivity Problems

Symptoms:

Cannot access inference endpoints
Load balancer not provisioning

Diagnostic Steps:

# Check service status
kubectl get svc -A

# Check AWS Load Balancer Controller
kubectl logs -n kube-system deployment/aws-load-balancer-controller

# Verify security groups and NACLs
aws ec2 describe-security-groups --filters "Name=group-name,Values=*inference-cluster*"

Solutions:

# Restart AWS Load Balancer Controller
kubectl rollout restart deployment/aws-load-balancer-controller -n kube-system

# Check ingress annotations and configurations
kubectl describe ingress <ingress-name>

Monitoring and Observability Issues

9. Prometheus/Grafana Not Working

Symptoms:

Monitoring dashboards not accessible
Metrics not being collected

Diagnostic Commands:

# Check monitoring stack pods
kubectl get pods -n monitoring

# Check Prometheus targets
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Navigate to http://localhost:9090/targets

# Check Grafana access
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Navigate to http://localhost:3000

Solutions:

# Restart monitoring components
kubectl rollout restart deployment/kube-prometheus-stack-grafana -n monitoring

# Check service monitors
kubectl get servicemonitor -A

Performance and Scaling Issues

10. Slow Model Inference

Symptoms:

High latency in model responses
Poor throughput performance

Diagnostic Steps:

# Check resource utilization
kubectl top pods -n <namespace> --containers

# Monitor GPU utilization (if using GPUs)
kubectl exec -it <pod-name> -- nvidia-smi

Solutions:

Verify model is using appropriate hardware acceleration
Check if multiple models are competing for resources
Optimize model parameters for latency
Scale up model and use load balancing

General Debugging Commands

# Get cluster information
kubectl cluster-info
kubectl get nodes -o wide

# Check all system pods
kubectl get pods -A | grep -v Running

# View recent events
kubectl get events --sort-by='.lastTimestamp' -A

# Check Karpenter provisioning
kubectl logs -n karpenter deployment/karpenter --tail=100

# Verify EKS add-ons
aws eks describe-addon --cluster-name inference-cluster --addon-name vpc-cni

Getting Additional Help

If you continue to experience issues:

Check AWS Service Health: Visit the AWS Service Health Dashboard
Review CloudWatch Logs: Check EKS control plane logs in CloudWatch
Consult Documentation: Refer to the EKS Troubleshooting Guide
Community Support: Post questions in the AI on EKS GitHub Issues

Why?​

Use Cases​

Introduction​

Resources​

Core Infrastructure​

EKS Add-ons​

AI/ML Specific Components​

Observability Stack​

AIBrix Components​

Deployment​

Architecture Diagram​

Prerequisites​

Step 0: Clone and Navigate​

Step 1: (Optional) Configure Variables​

Step 2: Deploy Infrastructure​

Step 3: VPC​

Step 4: EKS​

Step 5: EKS add-ons​

Step 6: Observability​

Step 7: Cluster Ready​

Step 8: Verify Deployment​

Inference on EKS​

Inference Charts Integration​

Prerequisites for Inference Charts​

Supported Inference Patterns​

vLLM Deployments​

Ray-vLLM Deployments​

AIBrix Deployments​

Example Deployments​

Observability Integration​

Cost Optimization​

Troubleshooting​

Deployment Issues​

1. Terraform Apply Failures​

2. EKS Cluster Creation Issues​

Node and Pod Issues​

3. Pods Stuck in Pending State​

4. GPU Detection and Device Plugin Issues​

5. AWS Neuron Setup Issues​

Model Deployment Issues​

6. Model Download Failures​

7. Out of Memory (OOM) Issues​

Networking and Load Balancer Issues​

8. Service Connectivity Problems​

Monitoring and Observability Issues​

9. Prometheus/Grafana Not Working​

Performance and Scaling Issues​

10. Slow Model Inference​

General Debugging Commands​

Getting Additional Help​

Why?

Use Cases

Introduction

Resources

Core Infrastructure

EKS Add-ons

AI/ML Specific Components

Observability Stack

AIBrix Components

Deployment

Architecture Diagram

Prerequisites

Step 0: Clone and Navigate

Step 1: (Optional) Configure Variables

Step 2: Deploy Infrastructure

Step 3: VPC

Step 4: EKS

Step 5: EKS add-ons

Step 6: Observability

Step 7: Cluster Ready

Step 8: Verify Deployment

Inference on EKS

Inference Charts Integration

Prerequisites for Inference Charts

Supported Inference Patterns

vLLM Deployments

Ray-vLLM Deployments

AIBrix Deployments

Example Deployments

Observability Integration

Cost Optimization

Troubleshooting

Deployment Issues

1. Terraform Apply Failures

2. EKS Cluster Creation Issues

Node and Pod Issues

3. Pods Stuck in Pending State

4. GPU Detection and Device Plugin Issues

5. AWS Neuron Setup Issues

Model Deployment Issues

6. Model Download Failures

7. Out of Memory (OOM) Issues

Networking and Load Balancer Issues

8. Service Connectivity Problems

Monitoring and Observability Issues

9. Prometheus/Grafana Not Working

Performance and Scaling Issues

10. Slow Model Inference

General Debugging Commands

Getting Additional Help