Skip to main content
warning

Deployment of ML models on EKS requires access to GPUs or Neuron instances. If your deployment isn't working, it's often due to missing access to these resources. Also, some deployment patterns rely on Karpenter autoscaling and static node groups; if nodes aren't initializing, check the logs for Karpenter or Node groups to resolve the issue.

info

NVIDIA Dynamo is a cloud-native platform for deploying and managing AI inference graphs at scale. This implementation provides complete infrastructure setup with enterprise-grade monitoring and scalability on Amazon EKS.

NVIDIA Dynamo on Amazon EKS

Active Development

This NVIDIA Dynamo blueprint is currently in active development. We are continuously improving the user experience and functionality. Features, configurations, and deployment processes may change between releases as we iterate and enhance the implementation based on user feedback and best practices.

Please expect iterative improvements in upcoming releases. If you encounter any issues or have suggestions for improvements, please feel free to open an issue or contribute to the project.

Quick Start

Want to get started immediately? Here's the minimal command sequence:

# 1. Clone and navigate
git clone https://github.com/awslabs/ai-on-eks.git && cd ai-on-eks/infra/nvidia-dynamo

# 2. Deploy infrastructure and platform (15-30 minutes)
./install.sh

# 3. Deploy inference examples using prebuilt NGC containers
cd ../../blueprints/inference/nvidia-dynamo

./deploy.sh # Interactive menu to choose example
# ./deploy.sh vllm # Deploy vLLM with interactive setup

# 4. Test your deployment (wait for model download)
kubectl port-forward svc/vllm-frontend 8000:8000 -n dynamo-cloud
curl http://localhost:8000/health

Prerequisites: AWS CLI, kubectl, helm, terraform, git, NGC API token, HuggingFace token (detailed setup below)


What is NVIDIA Dynamo?

NVIDIA Dynamo is an open-source inference framework designed to optimize performance and scalability for large language models (LLMs) and generative AI applications. Released under the Apache 2.0 license, Dynamo provides a datacenter-scale distributed inference serving framework that orchestrates complex AI workloads across multiple GPUs and nodes.

What is an Inference Graph?

An inference graph is a computational workflow that defines how AI models process data through interconnected nodes, enabling complex multi-step AI operations like:

  • LLM chains: Sequential processing through multiple language models
  • Multimodal processing: Combining text, image, and audio processing
  • Custom inference pipelines: Tailored workflows for specific AI applications
  • Disaggregated serving: Separating prefill and decode phases for optimal resource utilization

Overview

This blueprint uses the official NVIDIA Dynamo Helm charts from the NVIDIA NGC catalog, with additional shell scripts and Terraform automation to simplify the deployment process on Amazon EKS.

Deployment Approach

Why This Setup Process? While this implementation involves multiple steps, it provides several advantages over a simple Helm-only deployment:

  • Complete Infrastructure: Automatically provisions VPC, EKS cluster, ECR repositories, and monitoring stack
  • Production Ready: Includes enterprise-grade security, monitoring, and scalability features
  • AWS Integration: Leverages EKS autoscaling, EFA networking, and AWS services
  • Customizable: Allows fine-tuning of GPU node pools, networking, and resource allocation
  • Reproducible: Infrastructure as Code ensures consistent deployments across environments

For Simpler Deployments: If you already have an EKS cluster and prefer a minimal setup, you can use the Dynamo Helm charts directly from the source repository. This blueprint provides the full production-ready experience.

As LLMs and generative AI applications become increasingly prevalent, the demand for efficient, scalable, and low-latency inference solutions has grown. Traditional inference systems often struggle to meet these demands, especially in distributed, multi-node environments. NVIDIA Dynamo addresses these challenges by offering innovative solutions to optimize performance and scalability with support for AWS services such as Amazon S3, Elastic Fabric Adapter (EFA), and Amazon EKS.

Key Features

Performance Optimizations:

  • Disaggregated Serving: Separates prefill and decode phases across different GPUs for optimal resource utilization
  • Dynamic GPU Scheduling: Intelligent resource allocation based on real-time demand through the NVIDIA Dynamo Planner
  • Smart Request Routing: Minimizes KV cache recomputation by routing requests to workers with relevant cached data
  • Accelerated Data Transfer: Low-latency communication via NVIDIA NIXL library
  • Efficient KV Cache Management: Intelligent offloading across memory hierarchies with the KV Cache Block Manager

Infrastructure Ready:

  • Inference Engine Agnostic: Supports TensorRT-LLM, vLLM, SGLang, and other runtimes
  • Modular Design: Pick and choose components that fit your existing AI stack
  • Enterprise Grade: Complete monitoring, logging, and security integration
  • Amazon EKS Optimized: Leverages EKS autoscaling, GPU support, and AWS services

Architecture

The deployment uses Amazon EKS with the following components:

NVIDIA Dynamo Architecture

Key Components:

  • VPC and Networking: Standard VPC with EFA support for low-latency inter-node communication
  • EKS Cluster: Managed Kubernetes with GPU-enabled node groups using Karpenter
  • Dynamo Platform: Operator, API Store, and supporting services (NATS, PostgreSQL, MinIO)
  • Monitoring Stack: Prometheus, Grafana, and AI/ML observability
  • Storage: Amazon EFS for shared model storage and caching

Prerequisites

System Requirements: Ubuntu 22.04 or 24.04 (NVIDIA Dynamo officially supports only these versions)

Install the following tools on your setup host (recommended: EC2 instance t3.xlarge or higher with EKS and ECR permissions):

Required API Tokens

  • NGC API Token: Required for accessing NVIDIA's prebuilt Dynamo container images
    • Sign up at NVIDIA NGC
    • Generate an API key from your account settings
    • Set as NGC_API_KEY environment variable or provide during installation
  • HuggingFace Token: Required for downloading models
    • Create account at HuggingFace
    • Generate access token with model read permissions
    • Set as HF_TOKEN environment variable or provide interactively during deployment

Deploying the Solution

👈

Available Examples

Production-Ready Examples

The following examples are fully tested and production-ready with comprehensive documentation:

ExampleRuntimeModelArchitectureNode TypeKey Features
hello-worldCPUN/AAggregatedCPUBasic connectivity testing
vllmvLLMQwen3-0.6BAggregatedG5 GPUOpenAI API, balanced performance
sglangSGLangDeepSeek-R1-Distill-8BAggregatedG5 GPURadixAttention caching
trtllmTensorRT-LLMDeepSeek-R1-Distill-8BAggregatedG5 GPUMaximum inference performance
multi-replica-vllmvLLMMultiple modelsMulti-replica HAG5 GPUKV routing, load balancing

Advanced Examples (Beta)

These examples demonstrate advanced Dynamo features and are suitable for experimental workloads:

ExampleRuntimeArchitectureUse CaseKey Features
vllm-disaggvLLMDisaggregatedHigh throughputSeparate prefill/decode workers
sglang-disaggSGLangDisaggregatedMemory optimizationRadixAttention + disaggregation
trtllm-disaggTensorRT-LLMDisaggregatedUltra-high performanceTRT-LLM + disaggregation
kv-routingMulti-runtimeIntelligent routingCache optimizationKV-aware request routing

Example Highlights

🚀 hello-world: Perfect starting point

  • CPU-only deployment for testing Dynamo platform functionality
  • Fast deployment (~2 minutes)
  • No GPU or model dependencies
  • Ideal for CI/CD validation

vllm: Recommended for most use cases

  • OpenAI-compatible API (/v1/chat/completions, /v1/models)
  • Small model (Qwen3-0.6B) for quick testing
  • Production-ready health checks
  • G5 GPU optimization

🧠 sglang: Advanced caching capabilities

  • RadixAttention for 2-10x speedup on repetitive queries
  • Structured generation support (JSON/XML)
  • Advanced memory management
  • Perfect for cache-heavy workloads

🏎️ trtllm: Maximum performance

  • NVIDIA TensorRT-LLM optimized kernels
  • Highest throughput and lowest latency
  • Custom CUDA kernels
  • Best for production serving

🌐 multi-replica-vllm: High availability deployments

  • Multiple independent worker replicas with KV routing
  • Automatic load balancing and failover
  • Intelligent cache-aware request routing
  • Ideal for production workloads requiring high availability
Comprehensive Testing

All 9 examples have been thoroughly tested and validated with on EKS clusters with GPU nodes. Each example includes proper health checks, OpenAI-compatible API endpoints, and production-ready configurations. See our testing summary for detailed validation results.

Test and Validate

Automated Testing

Use the built-in test script to validate your deployment:

./test.sh

This script:

  • Starts port forwarding to the frontend service
  • Tests health check, metrics, and /v1/models endpoints
  • Runs sample inference requests to verify functionality

Manual Testing

Access your deployment directly:

kubectl port-forward svc/<frontend-service> 8000:8000 -n dynamo-cloud &

curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{"role": "user", "content": "Explain what a Q-Bit is in quantum computing."}
],
"max_tokens": 2000,
"temperature": 0.7,
"stream": false
}'

Expected Output:

{
"id": "1918b11a-6d98-4891-bc84-08f99de70fd0",
"choices": [
{
"index": 0,
"message": {
"content": "A Q-bit, or qubit, is the basic unit of quantum information...",
"role": "assistant"
},
"finish_reason": "stop"
}
],
"created": 1752018267,
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"object": "chat.completion"
}

Monitor and Observe

Grafana Dashboard

Access Grafana for visualization (default port 3000):

kubectl port-forward -n kube-prometheus-stack svc/kube-prometheus-stack-grafana 3000:80

Prometheus Metrics

Access Prometheus for metrics collection (port 9090):

kubectl port-forward -n kube-prometheus-stack svc/prometheus 9090:80

Automatic Monitoring

The deployment automatically creates:

  • Service: Exposes inference graphs for API calls and metrics
  • ServiceMonitor: Configures Prometheus to scrape metrics
  • Dashboards: Pre-configured Grafana dashboards for inference monitoring

Advanced Configuration

Version Management

The deployment automatically manages Dynamo versions with flexible override options:

Default Behavior:

  • Reads version from terraform/blueprint.tfvars (dynamo_stack_version = "v0.4.1")
  • Automatically updates container image tags in YAML manifests
  • Creates temporary manifests without modifying source files

Override Options:

# Environment variable (highest priority)
export DYNAMO_VERSION=v0.4.1
./deploy.sh vllm

# Inline override
DYNAMO_VERSION=v0.4.1 ./deploy.sh sglang

# Update terraform/blueprint.tfvars (persistent)
dynamo_stack_version = "v0.4.1"

Supported Versions:

  • v0.4.1: Current stable release (default)
  • Custom versions from private builds

Custom Model Deployment

To deploy custom models, modify the configuration files in dynamo/examples/llm/configs/:

  1. Choose Architecture: Select based on model size and requirements
  2. Update Configuration: Edit the appropriate YAML file
  3. Set Model Parameters: Update model and served_model_name fields
  4. Configure Resources: Adjust GPU allocation and memory settings

Example for DeepSeek-R1 70B model:

Common:
model: deepseek-ai/DeepSeek-R1-Distill-Llama-70B
max-model-len: 32768
tensor-parallel-size: 4

Frontend:
served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-70B

VllmWorker:
ServiceArgs:
resources:
gpu: '4'

Configuration Options

The main configuration is in terraform/blueprint.tfvars:

# Required for Dynamo deployment
enable_dynamo_stack = true
enable_argocd = true

# Dynamo platform version
dynamo_stack_version = "v0.4.1"

# Required infrastructure components
enable_aws_efs_csi_driver = true
enable_aws_efa_k8s_device_plugin = true
enable_ai_ml_observability_stack = true

Troubleshooting

Common Issues

  1. GPU Nodes Not Available: Check Karpenter logs and instance availability
  2. Pod Failures: Check resource limits and cluster capacity
  3. Model Download Failures: Verify HuggingFace token and network connectivity
  4. API 503 Errors: Wait for model loading or check worker health

Debug Commands

# Check cluster status
kubectl get nodes
kubectl get pods -n dynamo-cloud

# View logs
kubectl logs -n argocd -l app.kubernetes.io/name=argocd-server
kubectl logs -n dynamo-cloud -l app=vllm-worker

# Check deployments
kubectl get dynamographdeployment -n dynamo-cloud
kubectl describe dynamographdeployment <name> -n dynamo-cloud

Node Selection and Customization

Selecting Instance Types

You can customize which Karpenter node pool your Dynamo components deploy to by modifying the nodeSelector in your DynamoGraphDeployment:

# Example: Deploy GPU worker to G5 instances
VllmWorker:
extraPodSpec:
nodeSelector:
karpenter.sh/nodepool: g5-gpu-karpenter
resources:
requests:
gpu: "1"

# Example: Deploy frontend to CPU instances
Frontend:
extraPodSpec:
nodeSelector:
karpenter.sh/nodepool: cpu-karpenter

Available Node Pools (configured in base infrastructure):

  • g5-gpu-karpenter: G5 instances with NVIDIA A10G GPUs
  • g6-gpu-karpenter: G6 instances with NVIDIA L4 GPUs (if configured)
  • cpu-karpenter: CPU-only instances for frontends

Custom Development

For advanced customization and development:

  1. Source Code: Full Dynamo source code is available at ~/dynamo with comprehensive documentation and examples
  2. Blueprint Examples: Each example in the blueprints/inference/nvidia-dynamo/ folder includes detailed README files
  3. Container Source: All source code is included in NGC containers at /workspace/ for in-container customization

Refer to the individual README files in each blueprint example for specific customization guidance.

Multi-Node Tensor Parallelism Limitations

Understanding Multi-Replica vs Multi-Node

It's important to distinguish between multi-replica deployments (what our examples provide) and true multi-node tensor parallelism (which requires specialized infrastructure):

What Our Examples Provide (Multi-Replica)

  • Multiple Independent Workers: Each worker replica runs the complete model independently (TP=1)
  • High Availability: Service continues operating if individual workers fail
  • Load Balancing: Requests distributed across workers for increased throughput
  • KV-Aware Routing: Intelligent request routing based on cache overlap to maximize performance
  • Kubernetes Native: Works seamlessly with standard Kubernetes deployments

What Our Examples Do NOT Provide (True Multi-Node TP)

  • Cross-Node Model Sharding: Models are not split across multiple nodes
  • Memory Scaling for Large Models: Each worker must fit the complete model (no cross-node memory sharing)
  • Tensor Parallelism Across Nodes: No cross-node tensor operations

Current Kubernetes Limitations

Kubernetes does not currently support true multi-node tensor parallelism for distributed inference workloads due to several technical constraints:

Infrastructure Requirements

True multi-node tensor parallelism requires:

  • MPI/Slurm Environment: Uses mpirun or srun for coordinated distributed model loading
  • Synchronized Initialization: All participating nodes must start simultaneously and maintain coordination
  • Low-Latency Interconnects: Requires InfiniBand, NVLink, or similar high-performance networking
  • Shared Process Groups: Distributed training/inference frameworks need process group management not available in K8s

Why Kubernetes Doesn't Support This (Currently)

  1. Pod Isolation: Kubernetes pods are designed to be isolated units, making cross-pod tensor operations challenging
  2. Dynamic Scheduling: K8s dynamic pod placement conflicts with the static, coordinated startup required for multi-node TP
  3. Network Abstraction: K8s networking abstractions don't expose the low-level network primitives needed for efficient tensor communication
  4. Missing MPI Integration: No native MPI job management in Kubernetes (though projects like MPI-Operator exist, they're not widely adopted for inference)

Current Support in Dynamo Backends

Based on the official Dynamo documentation and examples, here's what each backend supports:

SGLang Multi-Node Support ✅

  • Status: Fully supported for multi-node tensor parallelism
  • Requirements: Slurm environment with MPI coordination
  • Configuration: Uses --nnodes, --node-rank, and --dist-init-addr parameters
  • Example: DeepSeek-R1 across 4 nodes with TP16 (16 GPUs total)
  • Kubernetes: Not supported - requires Slurm/MPI environment
# SGLang multi-node example (Slurm only)
python3 -m dynamo.sglang.worker \
--model-path /model/ \
--tp 16 \
--nnodes 2 \
--node-rank 0 \
--dist-init-addr ${HEAD_NODE_IP}:29500

TensorRT-LLM Multi-Node Support ✅

  • Status: Fully supported with WideEP (Wide Expert Parallelism)
  • Requirements: Slurm environment with MPI launcher (srun or mpirun)
  • Configuration: Multi-node TP16/EP16 configurations available
  • Example: DeepSeek-R1 across 4x GB200 nodes
  • Kubernetes: Not supported - requires MPI coordination
# TRT-LLM multi-node example (Slurm only)
srun --nodes=4 --ntasks-per-node=4 \
python3 -m dynamo.trtllm \
--model-path /model/ \
--engine-config wide_ep_config.yaml

vLLM Multi-Node Support ❌

  • Status: Currently not supported for true multi-node tensor parallelism
  • Current Capability: Single-node tensor parallelism only (multiple GPUs on same node)
  • Our Implementation: Multi-replica for high availability (each replica runs full model)
  • Future: May be added in future vLLM releases

Workarounds for Large Models

If you need to run models that don't fit on a single node, consider these alternatives:

1. High-Memory Single-Node Instances

Use AWS instances with large GPU memory:

# Example: P5.48xlarge with 8x H100 (80GB each = 640GB total)
extraPodSpec:
nodeSelector:
karpenter.sh/nodepool: p5-gpu-karpenter
node.kubernetes.io/instance-type: p5.48xlarge
resources:
requests:
gpu: "8"

2. Model Optimization Techniques

  • Quantization: Use FP16, FP8, or INT8 quantized models
  • Model Pruning: Remove less important parameters
  • LoRA/QLoRA: Use parameter-efficient fine-tuned models

3. Slurm-Based Deployments

For models requiring true multi-node TP, deploy outside Kubernetes:

# Use official Dynamo examples with Slurm
cd ~/dynamo/docs/components/backends/trtllm/
./srun_disaggregated.sh # 8-node disaggregated deployment

4. Disaggregated Architecture

Use our disaggregated examples for better resource utilization:

  • Prefill Workers: Handle input processing (can be smaller instances)
  • Decode Workers: Handle token generation (optimized for throughput)
  • Independent Scaling: Scale each component based on workload

Future Development

Multi-Node Tensor Parallelism in Kubernetes may become available in future versions through:

  1. Enhanced MPI Integration: Projects like Kubeflow's MPI-Operator for inference workloads
  2. Native K8s Support: Kubernetes SIG-Scheduling working on gang scheduling and coordinated pod startup
  3. Vendor Solutions: Cloud providers may develop custom solutions for managed inference
  4. Framework Evolution: Inference frameworks adding Kubernetes-native distributed execution

Recommendations

For Current Deployments:

  1. Small to Medium Models (≤70B): Use single-node deployments with multi-GPU instances
  2. High Availability Needs: Use our multi-replica examples with KV routing
  3. Large Models (70B+): Consider Slurm-based deployments outside Kubernetes
  4. Maximum Performance: Use disaggregated architecture with optimized worker ratios

Monitoring Future Developments:

Alternative Deployment Options

For Existing EKS Clusters

If you already have an EKS cluster with GPU nodes and prefer a simpler approach:

  1. Direct Helm Installation: Use the official NVIDIA Dynamo Helm charts directly from the dynamo source repository
  2. Manual Setup: Follow the upstream NVIDIA Dynamo documentation for Kubernetes deployment
  3. Custom Integration: Integrate Dynamo components into your existing infrastructure

Why Use This Blueprint?

This blueprint is designed for users who want:

  • Complete Infrastructure: End-to-end setup from VPC to running inference
  • Production Readiness: Enterprise-grade monitoring, security, and scalability
  • AWS Integration: Optimized for EKS, ECR, EFA, and other AWS services
  • Best Practices: Follows ai-on-eks patterns and AWS recommendations

References

Official NVIDIA Resources

📚 Documentation:

🐙 Source Code:

📦 Container Images & Helm Charts:

AI-on-EKS Blueprint Resources

🏗️ Infrastructure & Examples:

📖 Example Documentation:

🚀 Inference Frameworks:

  • vLLM: High-throughput LLM inference engine
  • SGLang: Structured generation with RadixAttention
  • TensorRT-LLM: NVIDIA's optimized inference library

☸️ Kubernetes & AWS:

Next Steps

  1. Explore Examples: Check the examples folder in the GitHub repository
  2. Scale Deployments: Configure multi-node setups for larger models
  3. Integrate Applications: Connect your applications to the inference endpoints
  4. Monitor Performance: Use Grafana dashboards for ongoing monitoring
  5. Optimize Costs: Implement auto-scaling and resource optimization

Clean Up

When you're finished with your NVIDIA Dynamo deployment, remove all resources using the consolidated cleanup script:

cd infra/nvidia-dynamo
./cleanup.sh

What gets cleaned up (in proper order):

  • Dynamo Examples: All deployed inference graphs and workloads
  • Dynamo Platform: Operator, API Store, and supporting services
  • ArgoCD Applications: GitOps-managed resources
  • Kubernetes Resources: Namespaces, secrets, and configurations
  • Infrastructure: EKS cluster, VPC, security groups, and all AWS resources
  • Cost Optimization: Ensures no lingering resources continue billing

Features:

  • Intelligent Ordering: Cleans up dependencies in correct sequence
  • Safety Checks: Confirms resource existence before deletion attempts
  • Progress Feedback: Shows cleanup progress and any issues encountered
  • Complete Removal: No manual cleanup steps required

Duration: ~10-15 minutes for complete infrastructure teardown

This deployment provides a production-ready NVIDIA Dynamo environment on Amazon EKS with enterprise-grade features including Karpenter automatic scaling, EFA networking, and seamless AWS service integration.