Deployment of ML models on EKS requires access to GPUs or Neuron instances. If your deployment isn't working, it's often due to missing access to these resources. Also, some deployment patterns rely on Karpenter autoscaling and static node groups; if nodes aren't initializing, check the logs for Karpenter or Node groups to resolve the issue.
NVIDIA Dynamo is a cloud-native platform for deploying and managing AI inference graphs at scale. This implementation provides complete infrastructure setup with enterprise-grade monitoring and scalability on Amazon EKS.
NVIDIA Dynamo on Amazon EKS
This NVIDIA Dynamo blueprint is currently in active development. We are continuously improving the user experience and functionality. Features, configurations, and deployment processes may change between releases as we iterate and enhance the implementation based on user feedback and best practices.
Please expect iterative improvements in upcoming releases. If you encounter any issues or have suggestions for improvements, please feel free to open an issue or contribute to the project.
Quick Start
Want to get started immediately? Here's the minimal command sequence:
# 1. Clone and navigate
git clone https://github.com/awslabs/ai-on-eks.git && cd ai-on-eks/infra/nvidia-dynamo
# 2. Deploy everything (15-30 minutes)
./install.sh
# 3. Build base image and deploy inference
cd ../../blueprints/inference/nvidia-dynamo
source dynamo_env.sh
./build-base-image.sh vllm --push
./deploy.sh
# 4. Test your deployment
./test.sh
Prerequisites: AWS CLI, kubectl, docker, terraform, earthly, python3.10+, git (detailed setup below)
What is NVIDIA Dynamo?
NVIDIA Dynamo is an open-source inference framework designed to optimize performance and scalability for large language models (LLMs) and generative AI applications.
What is an Inference Graph?
An inference graph is a computational workflow that defines how AI models process data through interconnected nodes, enabling complex multi-step AI operations like:
- LLM chains: Sequential processing through multiple language models
- Multimodal processing: Combining text, image, and audio processing
- Custom inference pipelines: Tailored workflows for specific AI applications
- Disaggregated serving: Separating prefill and decode phases for optimal resource utilization
Overview
This blueprint uses the official NVIDIA Dynamo Helm charts from the dynamo source repository, with additional shell scripts and Terraform automation to simplify the deployment process on Amazon EKS.
Deployment Approach
Why This Setup Process? While this implementation involves multiple steps, it provides several advantages over a simple Helm-only deployment:
- Complete Infrastructure: Automatically provisions VPC, EKS cluster, ECR repositories, and monitoring stack
- Production Ready: Includes enterprise-grade security, monitoring, and scalability features
- AWS Integration: Leverages EKS autoscaling, EFA networking, and AWS services
- Customizable: Allows fine-tuning of GPU node pools, networking, and resource allocation
- Reproducible: Infrastructure as Code ensures consistent deployments across environments
For Simpler Deployments: If you already have an EKS cluster and prefer a minimal setup, you can use the Dynamo Helm charts directly from the source repository. This blueprint provides the full production-ready experience.
As LLMs and generative AI applications become increasingly prevalent, the demand for efficient, scalable, and low-latency inference solutions has grown. Traditional inference systems often struggle to meet these demands, especially in distributed, multi-node environments. NVIDIA Dynamo addresses these challenges by offering innovative solutions to optimize performance and scalability with support for AWS services such as Amazon S3, Elastic Fabric Adapter (EFA), and Amazon EKS.
Key Features
Performance Optimizations:
- Disaggregated Serving: Separates prefill and decode phases across different GPUs for optimal resource utilization
- Dynamic GPU Scheduling: Intelligent resource allocation based on real-time demand through the NVIDIA Dynamo Planner
- Smart Request Routing: Minimizes KV cache recomputation by routing requests to workers with relevant cached data
- Accelerated Data Transfer: Low-latency communication via NVIDIA NIXL library
- Efficient KV Cache Management: Intelligent offloading across memory hierarchies with the KV Cache Block Manager
Infrastructure Ready:
- Inference Engine Agnostic: Supports TensorRT-LLM, vLLM, SGLang, and other runtimes
- Modular Design: Pick and choose components that fit your existing AI stack
- Enterprise Grade: Complete monitoring, logging, and security integration
- Amazon EKS Optimized: Leverages EKS autoscaling, GPU support, and AWS services
Architecture
The deployment uses Amazon EKS with the following components:
Key Components:
- VPC and Networking: Standard VPC with EFA support for low-latency inter-node communication
- EKS Cluster: Managed Kubernetes with GPU-enabled node groups using Karpenter
- Dynamo Platform: Operator, API Store, and supporting services (NATS, PostgreSQL, MinIO)
- Monitoring Stack: Prometheus, Grafana, and AI/ML observability
- Storage: Amazon EFS for shared model storage and caching
Prerequisites
System Requirements: Ubuntu 22.04 or 24.04 (NVIDIA Dynamo officially supports only these versions)
Install the following tools on your setup host (recommended: EC2 instance t3.xlarge or higher with EKS and ECR permissions):
- AWS CLI: Configured with appropriate permissions (installation guide)
- kubectl: Kubernetes command-line tool (installation guide)
- helm: Kubernetes package manager (installation guide)
- terraform: Infrastructure as code tool (installation guide)
- docker: With buildx and user needs docker permissions (installation guide)
- earthly: Multi-platform build automation tool used by NVIDIA Dynamo for reproducible container builds (installation guide)
- Python 3.10+: With pip and venv (installation guide)
- git: Version control (installation guide)
- EKS Cluster: Version 1.33 (tested and supported)
Deploying the Solution
👈Test and Validate
Automated Testing
Use the built-in test script to validate your deployment:
./test.sh
This script:
- Starts port forwarding to the frontend service
- Tests health check, metrics, and
/v1/models
endpoints - Runs sample inference requests to verify functionality
Manual Testing
Access your deployment directly:
kubectl port-forward svc/<frontend-service> 3000:3000 -n dynamo-cloud &
curl -X POST http://localhost:3000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{"role": "user", "content": "Explain what a Q-Bit is in quantum computing."}
],
"max_tokens": 2000,
"temperature": 0.7,
"stream": false
}'
Expected Output:
{
"id": "1918b11a-6d98-4891-bc84-08f99de70fd0",
"choices": [
{
"index": 0,
"message": {
"content": "A Q-bit, or qubit, is the basic unit of quantum information...",
"role": "assistant"
},
"finish_reason": "stop"
}
],
"created": 1752018267,
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"object": "chat.completion"
}
Monitor and Observe
Grafana Dashboard
Access Grafana for visualization (default port 3000):
kubectl port-forward -n kube-prometheus-stack svc/kube-prometheus-stack-grafana 3000:80
Prometheus Metrics
Access Prometheus for metrics collection (port 9090):
kubectl port-forward -n kube-prometheus-stack svc/prometheus 9090:80
Automatic Monitoring
The deployment automatically creates:
- Service: Exposes inference graphs for API calls and metrics
- ServiceMonitor: Configures Prometheus to scrape metrics
- Dashboards: Pre-configured Grafana dashboards for inference monitoring
Advanced Configuration
ECR Authentication
The deployment uses IRSA (IAM Roles for Service Accounts) for secure ECR access:
- Primary Method: IRSA eliminates credential rotation
- Fallback Method: ECR token refresh CronJob (legacy mode)
- Security: Service accounts automatically authenticate to ECR
- No Secrets: No long-lived credentials stored in Kubernetes
Note: AWS Pod Identity is available as an alternative to IRSA (GA since April 2024), but IRSA remains the recommended approach due to its maturity, wide adoption, and proven reliability in production environments.
Custom Model Deployment
To deploy custom models, modify the configuration files in dynamo/examples/llm/configs/
:
- Choose Architecture: Select based on model size and requirements
- Update Configuration: Edit the appropriate YAML file
- Set Model Parameters: Update
model
andserved_model_name
fields - Configure Resources: Adjust GPU allocation and memory settings
Example for DeepSeek-R1 70B model:
Common:
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
max-model-len: 32768
tensor-parallel-size: 4
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
VllmWorker:
ServiceArgs:
resources:
gpu: '4'
Karpenter Node Pools
The deployment can optionally use custom Karpenter node pools optimized for NVIDIA Dynamo:
- C7i CPU Pools: For general compute and BuildKit (newer than base M5 instances)
- G6 GPU Pools: For inference workloads with NVIDIA L4 GPUs
- Higher Priority: Weight 100 vs base addons weight 50 for priority scheduling
- BuildKit Support: User namespace configuration for container builds
- EFA Support: Low-latency networking for multi-node setups
Note: Custom node pools are disabled by default. The base infrastructure provides existing Karpenter node pools (G6 GPU, G5 GPU, M5 CPU) that work well for most Dynamo workloads. Enable custom pools only if you need BuildKit support or higher scheduling priority.
Configuration Options
Modify terraform/blueprint.tfvars
for customization:
# Enable custom node pools (optional - disabled by default)
enable_custom_karpenter_nodepools = true
# Choose AMI (AL2023 recommended)
use_bottlerocket = false
# Resource limits
karpenter_cpu_limits = 10000
karpenter_memory_limits = 10000
Troubleshooting
Common Issues
- GPU Nodes Not Available: Check Karpenter logs and instance availability
- Image Pull Errors: Verify ECR repositories and image push success
- Pod Failures: Check resource limits and cluster capacity
- Deployment Timeouts: Ensure base images are built and available
Debug Commands
# Check cluster status
kubectl get nodes
kubectl get pods -n dynamo-cloud
# View logs
kubectl logs -n dynamo-cloud deployment/dynamo-operator
kubectl logs -n dynamo-cloud deployment/dynamo-api-store
# Check deployments
source dynamo_venv/bin/activate
dynamo deployment list --endpoint "$DYNAMO_CLOUD"
Performance Optimization
- Model Size Guidelines: Use appropriate architecture for model parameters
- Resource Allocation: Match GPU count to model requirements
- Network Configuration: Ensure EFA is enabled for multi-node setups
- Storage: Use EFS for shared model storage and caching
Alternative Deployment Options
For Existing EKS Clusters
If you already have an EKS cluster with GPU nodes and prefer a simpler approach:
- Direct Helm Installation: Use the official NVIDIA Dynamo Helm charts directly from the dynamo source repository
- Manual Setup: Follow the upstream NVIDIA Dynamo documentation for Kubernetes deployment
- Custom Integration: Integrate Dynamo components into your existing infrastructure
Why Use This Blueprint?
This blueprint is designed for users who want:
- Complete Infrastructure: End-to-end setup from VPC to running inference
- Production Readiness: Enterprise-grade monitoring, security, and scalability
- AWS Integration: Optimized for EKS, ECR, EFA, and other AWS services
- Best Practices: Follows ai-on-eks patterns and AWS recommendations
Repository Information
- Repository: awslabs/ai-on-eks
- Documentation: Complete NVIDIA Dynamo Blueprint
- NVIDIA Dynamo: Official Documentation
- Dynamo Source: NVIDIA Dynamo Repository
Next Steps
- Explore Examples: Check the examples folder in the GitHub repository
- Scale Deployments: Configure multi-node setups for larger models
- Integrate Applications: Connect your applications to the inference endpoints
- Monitor Performance: Use Grafana dashboards for ongoing monitoring
- Optimize Costs: Implement auto-scaling and resource optimization
Clean Up
When you're finished with your NVIDIA Dynamo deployment, remove all resources using the cleanup script:
cd infra/nvidia-dynamo
./cleanup.sh
This safely destroys the NVIDIA Dynamo deployments and infrastructure components in the correct order, including:
- Dynamo platform components and workloads
- Kubernetes resources and namespaces
- ECR repositories and container images
- Terraform-managed infrastructure (EKS cluster, VPC, etc.)
The cleanup script ensures proper resource cleanup to avoid any lingering costs.
This deployment provides a production-ready NVIDIA Dynamo environment on Amazon EKS with enterprise-grade features including Karpenter automatic scaling, EFA networking, and seamless AWS service integration.