Deployment of ML models on EKS requires access to GPUs or Neuron instances. If your deployment isn't working, it's often due to missing access to these resources. Also, some deployment patterns rely on Karpenter autoscaling and static node groups; if nodes aren't initializing, check the logs for Karpenter or Node groups to resolve the issue.
NVIDIA Dynamo is a cloud-native platform for deploying and managing AI inference graphs at scale. This implementation provides complete infrastructure setup with enterprise-grade monitoring and scalability on Amazon EKS.
NVIDIA Dynamo on Amazon EKS
This NVIDIA Dynamo blueprint is currently in active development. We are continuously improving the user experience and functionality. Features, configurations, and deployment processes may change between releases as we iterate and enhance the implementation based on user feedback and best practices.
Please expect iterative improvements in upcoming releases. If you encounter any issues or have suggestions for improvements, please feel free to open an issue or contribute to the project.
Quick Start
Want to get started immediately? Here's the minimal command sequence:
# 1. Clone and navigate
git clone https://github.com/awslabs/ai-on-eks.git && cd ai-on-eks/infra/nvidia-dynamo
# 2. Deploy infrastructure and platform (15-30 minutes)
./install.sh
# 3. Deploy inference examples using prebuilt NGC containers
cd ../../blueprints/inference/nvidia-dynamo
./deploy.sh # Interactive menu to choose example
# ./deploy.sh vllm # Deploy vLLM with interactive setup
# 4. Test your deployment (wait for model download)
kubectl port-forward svc/vllm-frontend 8000:8000 -n dynamo-cloud
curl http://localhost:8000/health
Prerequisites: AWS CLI, kubectl, helm, terraform, git, NGC API token, HuggingFace token (detailed setup below)
What is NVIDIA Dynamo?
NVIDIA Dynamo is an open-source inference framework designed to optimize performance and scalability for large language models (LLMs) and generative AI applications. Released under the Apache 2.0 license, Dynamo provides a datacenter-scale distributed inference serving framework that orchestrates complex AI workloads across multiple GPUs and nodes.
What is an Inference Graph?
An inference graph is a computational workflow that defines how AI models process data through interconnected nodes, enabling complex multi-step AI operations like:
- LLM chains: Sequential processing through multiple language models
- Multimodal processing: Combining text, image, and audio processing
- Custom inference pipelines: Tailored workflows for specific AI applications
- Disaggregated serving: Separating prefill and decode phases for optimal resource utilization
Overview
This blueprint uses the official NVIDIA Dynamo Helm charts from the NVIDIA NGC catalog, with additional shell scripts and Terraform automation to simplify the deployment process on Amazon EKS.
Deployment Approach
Why This Setup Process? While this implementation involves multiple steps, it provides several advantages over a simple Helm-only deployment:
- Complete Infrastructure: Automatically provisions VPC, EKS cluster, ECR repositories, and monitoring stack
- Production Ready: Includes enterprise-grade security, monitoring, and scalability features
- AWS Integration: Leverages EKS autoscaling, EFA networking, and AWS services
- Customizable: Allows fine-tuning of GPU node pools, networking, and resource allocation
- Reproducible: Infrastructure as Code ensures consistent deployments across environments
For Simpler Deployments: If you already have an EKS cluster and prefer a minimal setup, you can use the Dynamo Helm charts directly from the source repository. This blueprint provides the full production-ready experience.
As LLMs and generative AI applications become increasingly prevalent, the demand for efficient, scalable, and low-latency inference solutions has grown. Traditional inference systems often struggle to meet these demands, especially in distributed, multi-node environments. NVIDIA Dynamo addresses these challenges by offering innovative solutions to optimize performance and scalability with support for AWS services such as Amazon S3, Elastic Fabric Adapter (EFA), and Amazon EKS.
Key Features
Performance Optimizations:
- Disaggregated Serving: Separates prefill and decode phases across different GPUs for optimal resource utilization
- Dynamic GPU Scheduling: Intelligent resource allocation based on real-time demand through the NVIDIA Dynamo Planner
- Smart Request Routing: Minimizes KV cache recomputation by routing requests to workers with relevant cached data
- Accelerated Data Transfer: Low-latency communication via NVIDIA NIXL library
- Efficient KV Cache Management: Intelligent offloading across memory hierarchies with the KV Cache Block Manager
Infrastructure Ready:
- Inference Engine Agnostic: Supports TensorRT-LLM, vLLM, SGLang, and other runtimes
- Modular Design: Pick and choose components that fit your existing AI stack
- Enterprise Grade: Complete monitoring, logging, and security integration
- Amazon EKS Optimized: Leverages EKS autoscaling, GPU support, and AWS services
Architecture
The deployment uses Amazon EKS with the following components:

Key Components:
- VPC and Networking: Standard VPC with EFA support for low-latency inter-node communication
- EKS Cluster: Managed Kubernetes with GPU-enabled node groups using Karpenter
- Dynamo Platform: Operator, API Store, and supporting services (NATS, PostgreSQL, MinIO)
- Monitoring Stack: Prometheus, Grafana, and AI/ML observability
- Storage: Amazon EFS for shared model storage and caching
Prerequisites
System Requirements: Ubuntu 22.04 or 24.04 (NVIDIA Dynamo officially supports only these versions)
Install the following tools on your setup host (recommended: EC2 instance t3.xlarge or higher with EKS and ECR permissions):
- AWS CLI: Configured with appropriate permissions (installation guide)
- kubectl: Kubernetes command-line tool (installation guide)
- helm: Kubernetes package manager (installation guide)
- terraform: Infrastructure as code tool (installation guide)
- git: Version control (installation guide)
- Python 3.10+: With pip and venv (installation guide)
- EKS Cluster: Version 1.33 (tested and supported)
Required API Tokens
- NGC API Token: Required for accessing NVIDIA's prebuilt Dynamo container images
- Sign up at NVIDIA NGC
- Generate an API key from your account settings
- Set as
NGC_API_KEYenvironment variable or provide during installation
- HuggingFace Token: Required for downloading models
- Create account at HuggingFace
- Generate access token with model read permissions
- Set as
HF_TOKENenvironment variable or provide interactively during deployment
Deploying the Solution
👈Available Examples
Production-Ready Examples
The following examples are fully tested and production-ready with comprehensive documentation:
| Example | Runtime | Model | Architecture | Node Type | Key Features |
|---|---|---|---|---|---|
| hello-world | CPU | N/A | Aggregated | CPU | Basic connectivity testing |
| vllm | vLLM | Qwen3-0.6B | Aggregated | G5 GPU | OpenAI API, balanced performance |
| sglang | SGLang | DeepSeek-R1-Distill-8B | Aggregated | G5 GPU | RadixAttention caching |
| trtllm | TensorRT-LLM | DeepSeek-R1-Distill-8B | Aggregated | G5 GPU | Maximum inference performance |
| multi-replica-vllm | vLLM | Multiple models | Multi-replica HA | G5 GPU | KV routing, load balancing |
Advanced Examples (Beta)
These examples demonstrate advanced Dynamo features and are suitable for experimental workloads:
| Example | Runtime | Architecture | Use Case | Key Features |
|---|---|---|---|---|
| vllm-disagg | vLLM | Disaggregated | High throughput | Separate prefill/decode workers |
| sglang-disagg | SGLang | Disaggregated | Memory optimization | RadixAttention + disaggregation |
| trtllm-disagg | TensorRT-LLM | Disaggregated | Ultra-high performance | TRT-LLM + disaggregation |
| kv-routing | Multi-runtime | Intelligent routing | Cache optimization | KV-aware request routing |
Example Highlights
🚀 hello-world: Perfect starting point
- CPU-only deployment for testing Dynamo platform functionality
- Fast deployment (~2 minutes)
- No GPU or model dependencies
- Ideal for CI/CD validation
⚡ vllm: Recommended for most use cases
- OpenAI-compatible API (
/v1/chat/completions,/v1/models) - Small model (Qwen3-0.6B) for quick testing
- Production-ready health checks
- G5 GPU optimization
🧠 sglang: Advanced caching capabilities
- RadixAttention for 2-10x speedup on repetitive queries
- Structured generation support (JSON/XML)
- Advanced memory management
- Perfect for cache-heavy workloads
🏎️ trtllm: Maximum performance
- NVIDIA TensorRT-LLM optimized kernels
- Highest throughput and lowest latency
- Custom CUDA kernels
- Best for production serving
🌐 multi-replica-vllm: High availability deployments
- Multiple independent worker replicas with KV routing
- Automatic load balancing and failover
- Intelligent cache-aware request routing
- Ideal for production workloads requiring high availability
All 9 examples have been thoroughly tested and validated with on EKS clusters with GPU nodes. Each example includes proper health checks, OpenAI-compatible API endpoints, and production-ready configurations. See our testing summary for detailed validation results.
Test and Validate
Automated Testing
Use the built-in test script to validate your deployment:
./test.sh
This script:
- Starts port forwarding to the frontend service
- Tests health check, metrics, and
/v1/modelsendpoints - Runs sample inference requests to verify functionality
Manual Testing
Access your deployment directly:
kubectl port-forward svc/<frontend-service> 8000:8000 -n dynamo-cloud &
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{"role": "user", "content": "Explain what a Q-Bit is in quantum computing."}
],
"max_tokens": 2000,
"temperature": 0.7,
"stream": false
}'
Expected Output:
{
"id": "1918b11a-6d98-4891-bc84-08f99de70fd0",
"choices": [
{
"index": 0,
"message": {
"content": "A Q-bit, or qubit, is the basic unit of quantum information...",
"role": "assistant"
},
"finish_reason": "stop"
}
],
"created": 1752018267,
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"object": "chat.completion"
}
Monitor and Observe
Grafana Dashboard
Access Grafana for visualization (default port 3000):
kubectl port-forward -n kube-prometheus-stack svc/kube-prometheus-stack-grafana 3000:80
Prometheus Metrics
Access Prometheus for metrics collection (port 9090):
kubectl port-forward -n kube-prometheus-stack svc/prometheus 9090:80
Automatic Monitoring
The deployment automatically creates:
- Service: Exposes inference graphs for API calls and metrics
- ServiceMonitor: Configures Prometheus to scrape metrics
- Dashboards: Pre-configured Grafana dashboards for inference monitoring
Advanced Configuration
Version Management
The deployment automatically manages Dynamo versions with flexible override options:
Default Behavior:
- Reads version from
terraform/blueprint.tfvars(dynamo_stack_version = "v0.4.1") - Automatically updates container image tags in YAML manifests
- Creates temporary manifests without modifying source files
Override Options:
# Environment variable (highest priority)
export DYNAMO_VERSION=v0.4.1
./deploy.sh vllm
# Inline override
DYNAMO_VERSION=v0.4.1 ./deploy.sh sglang
# Update terraform/blueprint.tfvars (persistent)
dynamo_stack_version = "v0.4.1"
Supported Versions:
- v0.4.1: Current stable release (default)
- Custom versions from private builds
Custom Model Deployment
To deploy custom models, modify the configuration files in dynamo/examples/llm/configs/:
- Choose Architecture: Select based on model size and requirements
- Update Configuration: Edit the appropriate YAML file
- Set Model Parameters: Update
modelandserved_model_namefields - Configure Resources: Adjust GPU allocation and memory settings
Example for DeepSeek-R1 70B model:
Common:
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
max-model-len: 32768
tensor-parallel-size: 4
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
VllmWorker:
ServiceArgs:
resources:
gpu: '4'
Configuration Options
The main configuration is in terraform/blueprint.tfvars:
# Required for Dynamo deployment
enable_dynamo_stack = true
enable_argocd = true
# Dynamo platform version
dynamo_stack_version = "v0.4.1"
# Required infrastructure components
enable_aws_efs_csi_driver = true
enable_aws_efa_k8s_device_plugin = true
enable_ai_ml_observability_stack = true
Troubleshooting
Common Issues
- GPU Nodes Not Available: Check Karpenter logs and instance availability
- Pod Failures: Check resource limits and cluster capacity
- Model Download Failures: Verify HuggingFace token and network connectivity
- API 503 Errors: Wait for model loading or check worker health
Debug Commands
# Check cluster status
kubectl get nodes
kubectl get pods -n dynamo-cloud
# View logs
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-server
kubectl logs -n dynamo-cloud -l app=vllm-worker
# Check deployments
kubectl get dynamographdeployment -n dynamo-cloud
kubectl describe dynamographdeployment <name> -n dynamo-cloud
Node Selection and Customization
Selecting Instance Types
You can customize which Karpenter node pool your Dynamo components deploy to by modifying the nodeSelector in your DynamoGraphDeployment:
# Example: Deploy GPU worker to G5 instances
VllmWorker:
extraPodSpec:
nodeSelector:
karpenter.sh/nodepool: g5-gpu-karpenter
resources:
requests:
gpu: "1"
# Example: Deploy frontend to CPU instances
Frontend:
extraPodSpec:
nodeSelector:
karpenter.sh/nodepool: cpu-karpenter
Available Node Pools (configured in base infrastructure):
g5-gpu-karpenter: G5 instances with NVIDIA A10G GPUsg6-gpu-karpenter: G6 instances with NVIDIA L4 GPUs (if configured)cpu-karpenter: CPU-only instances for frontends
Custom Development
For advanced customization and development:
- Source Code: Full Dynamo source code is available at ~/dynamo with comprehensive documentation and examples
- Blueprint Examples: Each example in the
blueprints/inference/nvidia-dynamo/folder includes detailed README files - Container Source: All source code is included in NGC containers at
/workspace/for in-container customization
Refer to the individual README files in each blueprint example for specific customization guidance.
Multi-Node Tensor Parallelism Limitations
Understanding Multi-Replica vs Multi-Node
It's important to distinguish between multi-replica deployments (what our examples provide) and true multi-node tensor parallelism (which requires specialized infrastructure):
What Our Examples Provide (Multi-Replica)
- Multiple Independent Workers: Each worker replica runs the complete model independently (TP=1)
- High Availability: Service continues operating if individual workers fail
- Load Balancing: Requests distributed across workers for increased throughput
- KV-Aware Routing: Intelligent request routing based on cache overlap to maximize performance
- Kubernetes Native: Works seamlessly with standard Kubernetes deployments
What Our Examples Do NOT Provide (True Multi-Node TP)
- Cross-Node Model Sharding: Models are not split across multiple nodes
- Memory Scaling for Large Models: Each worker must fit the complete model (no cross-node memory sharing)
- Tensor Parallelism Across Nodes: No cross-node tensor operations
Current Kubernetes Limitations
Kubernetes does not currently support true multi-node tensor parallelism for distributed inference workloads due to several technical constraints:
Infrastructure Requirements
True multi-node tensor parallelism requires:
- MPI/Slurm Environment: Uses
mpirunorsrunfor coordinated distributed model loading - Synchronized Initialization: All participating nodes must start simultaneously and maintain coordination
- Low-Latency Interconnects: Requires InfiniBand, NVLink, or similar high-performance networking
- Shared Process Groups: Distributed training/inference frameworks need process group management not available in K8s
Why Kubernetes Doesn't Support This (Currently)
- Pod Isolation: Kubernetes pods are designed to be isolated units, making cross-pod tensor operations challenging
- Dynamic Scheduling: K8s dynamic pod placement conflicts with the static, coordinated startup required for multi-node TP
- Network Abstraction: K8s networking abstractions don't expose the low-level network primitives needed for efficient tensor communication
- Missing MPI Integration: No native MPI job management in Kubernetes (though projects like MPI-Operator exist, they're not widely adopted for inference)
Current Support in Dynamo Backends
Based on the official Dynamo documentation and examples, here's what each backend supports:
SGLang Multi-Node Support ✅
- Status: Fully supported for multi-node tensor parallelism
- Requirements: Slurm environment with MPI coordination
- Configuration: Uses
--nnodes,--node-rank, and--dist-init-addrparameters - Example: DeepSeek-R1 across 4 nodes with TP16 (16 GPUs total)
- Kubernetes: Not supported - requires Slurm/MPI environment
# SGLang multi-node example (Slurm only)
python3 -m dynamo.sglang.worker \
--model-path /model/ \
--tp 16 \
--nnodes 2 \
--node-rank 0 \
--dist-init-addr ${HEAD_NODE_IP}:29500
TensorRT-LLM Multi-Node Support ✅
- Status: Fully supported with WideEP (Wide Expert Parallelism)
- Requirements: Slurm environment with MPI launcher (
srunormpirun) - Configuration: Multi-node TP16/EP16 configurations available
- Example: DeepSeek-R1 across 4x GB200 nodes
- Kubernetes: Not supported - requires MPI coordination
# TRT-LLM multi-node example (Slurm only)
srun --nodes=4 --ntasks-per-node=4 \
python3 -m dynamo.trtllm \
--model-path /model/ \
--engine-config wide_ep_config.yaml
vLLM Multi-Node Support ❌
- Status: Currently not supported for true multi-node tensor parallelism
- Current Capability: Single-node tensor parallelism only (multiple GPUs on same node)
- Our Implementation: Multi-replica for high availability (each replica runs full model)
- Future: May be added in future vLLM releases
Workarounds for Large Models
If you need to run models that don't fit on a single node, consider these alternatives:
1. High-Memory Single-Node Instances
Use AWS instances with large GPU memory:
# Example: P5.48xlarge with 8x H100 (80GB each = 640GB total)
extraPodSpec:
nodeSelector:
karpenter.sh/nodepool: p5-gpu-karpenter
node.kubernetes.io/instance-type: p5.48xlarge
resources:
requests:
gpu: "8"
2. Model Optimization Techniques
- Quantization: Use FP16, FP8, or INT8 quantized models
- Model Pruning: Remove less important parameters
- LoRA/QLoRA: Use parameter-efficient fine-tuned models
3. Slurm-Based Deployments
For models requiring true multi-node TP, deploy outside Kubernetes:
# Use official Dynamo examples with Slurm
cd ~/dynamo/docs/components/backends/trtllm/
./srun_disaggregated.sh # 8-node disaggregated deployment
4. Disaggregated Architecture
Use our disaggregated examples for better resource utilization:
- Prefill Workers: Handle input processing (can be smaller instances)
- Decode Workers: Handle token generation (optimized for throughput)
- Independent Scaling: Scale each component based on workload
Future Development
Multi-Node Tensor Parallelism in Kubernetes may become available in future versions through:
- Enhanced MPI Integration: Projects like Kubeflow's MPI-Operator for inference workloads
- Native K8s Support: Kubernetes SIG-Scheduling working on gang scheduling and coordinated pod startup
- Vendor Solutions: Cloud providers may develop custom solutions for managed inference
- Framework Evolution: Inference frameworks adding Kubernetes-native distributed execution
Recommendations
For Current Deployments:
- Small to Medium Models (≤70B): Use single-node deployments with multi-GPU instances
- High Availability Needs: Use our multi-replica examples with KV routing
- Large Models (70B+): Consider Slurm-based deployments outside Kubernetes
- Maximum Performance: Use disaggregated architecture with optimized worker ratios
Monitoring Future Developments:
- Follow Dynamo releases for Kubernetes multi-node TP updates
- Check TensorRT-LLM and vLLM roadmaps
- Monitor Kubernetes SIG-Scheduling for gang scheduling improvements
Alternative Deployment Options
For Existing EKS Clusters
If you already have an EKS cluster with GPU nodes and prefer a simpler approach:
- Direct Helm Installation: Use the official NVIDIA Dynamo Helm charts directly from the dynamo source repository
- Manual Setup: Follow the upstream NVIDIA Dynamo documentation for Kubernetes deployment
- Custom Integration: Integrate Dynamo components into your existing infrastructure
Why Use This Blueprint?
This blueprint is designed for users who want:
- Complete Infrastructure: End-to-end setup from VPC to running inference
- Production Readiness: Enterprise-grade monitoring, security, and scalability
- AWS Integration: Optimized for EKS, ECR, EFA, and other AWS services
- Best Practices: Follows ai-on-eks patterns and AWS recommendations
References
Official NVIDIA Resources
📚 Documentation:
- NVIDIA Dynamo Official Docs: Complete platform documentation
- NVIDIA Developer Blog: Introduction and architecture overview
- NVIDIA Dynamo Product Page: Official product information
🐙 Source Code:
- NVIDIA Dynamo GitHub: Main repository with source code
- NVIDIA NIXL Library: NVIDIA Inference Xfer Library for low-latency communication
📦 Container Images & Helm Charts:
- Dynamo Collection (NGC): Complete collection of Dynamo resources
- Dynamo Platform Helm Chart: Official Kubernetes deployment
- vLLM Runtime Container: vLLM backend (v0.4.1)
- SGLang Runtime Container: SGLang backend (v0.4.1)
- TensorRT-LLM Runtime Container: TRT-LLM backend (v0.4.1)
AI-on-EKS Blueprint Resources
🏗️ Infrastructure & Examples:
- AI-on-EKS Repository: Main blueprint repository
- Dynamo Blueprint: Complete blueprint with examples
- Infrastructure Code: Terraform and deployment scripts
📖 Example Documentation:
- Hello World: CPU-only testing example
- vLLM Example: vLLM aggregated serving
- SGLang Example: RadixAttention caching
- TensorRT-LLM Example: Optimized inference
- Multi-Replica vLLM: High availability deployments
Related Technologies
🚀 Inference Frameworks:
- vLLM: High-throughput LLM inference engine
- SGLang: Structured generation with RadixAttention
- TensorRT-LLM: NVIDIA's optimized inference library
☸️ Kubernetes & AWS:
- Amazon EKS: Managed Kubernetes service
- Karpenter: Kubernetes node autoscaling
- ArgoCD: GitOps continuous delivery
Next Steps
- Explore Examples: Check the examples folder in the GitHub repository
- Scale Deployments: Configure multi-node setups for larger models
- Integrate Applications: Connect your applications to the inference endpoints
- Monitor Performance: Use Grafana dashboards for ongoing monitoring
- Optimize Costs: Implement auto-scaling and resource optimization
Clean Up
When you're finished with your NVIDIA Dynamo deployment, remove all resources using the consolidated cleanup script:
cd infra/nvidia-dynamo
./cleanup.sh
What gets cleaned up (in proper order):
- Dynamo Examples: All deployed inference graphs and workloads
- Dynamo Platform: Operator, API Store, and supporting services
- ArgoCD Applications: GitOps-managed resources
- Kubernetes Resources: Namespaces, secrets, and configurations
- Infrastructure: EKS cluster, VPC, security groups, and all AWS resources
- Cost Optimization: Ensures no lingering resources continue billing
Features:
- Intelligent Ordering: Cleans up dependencies in correct sequence
- Safety Checks: Confirms resource existence before deletion attempts
- Progress Feedback: Shows cleanup progress and any issues encountered
- Complete Removal: No manual cleanup steps required
Duration: ~10-15 minutes for complete infrastructure teardown
This deployment provides a production-ready NVIDIA Dynamo environment on Amazon EKS with enterprise-grade features including Karpenter automatic scaling, EFA networking, and seamless AWS service integration.