Inference on EKS
AI on EKS provides comprehensive solutions for deploying AI/ML inference workloads on Amazon EKS, supporting both GPU and AWS Neuron (Inferentia/Trainium) hardware configurations.
Quick Start Options
🚀 Inference Charts (Recommended)
Get started quickly with our pre-configured Helm charts that support multiple models and deployment patterns:
- Inference Charts - Streamlined Helm-based deployments with pre-configured values for popular models
- Supports both GPU and Neuron hardware
- Includes VLLM and Ray-VLLM frameworks
- Pre-configured for 10+ popular models including Llama, DeepSeek, and Mistral
Hardware-Specific Guides
GPU Deployments
Explore GPU-specific inference solutions:
- DeepSeek-R1 with Ray and vLLM
- NVIDIA NIM with Llama3
- NVIDIA NIM Operator
- vLLM with NVIDIA Triton Server
- vLLM with Ray Serve
- Stable Diffusion on GPUs
- AIBrix with DeepSeek
Neuron Deployments (AWS Inferentia)
Leverage AWS Inferentia chips for cost-effective inference:
- Llama2 on Inferentia2
- Llama3 on Inferentia2
- Mistral 7B on Inferentia2
- Ray Serve High Availability
- vLLM with Ray on Inferentia2
- Stable Diffusion on Inferentia2
Architecture Overview
AI on EKS inference solutions support multiple deployment patterns:
- Single-node inference with vLLM
- Distributed inference with Ray-vLLM
- Production-ready deployments with load balancing
- Auto-scaling capabilities
- Observability and monitoring integration
Choosing the Right Approach
Use Case | Recommended Solution | Benefits |
---|---|---|
Quick prototyping | Inference Charts | Pre-configured, fast deployment |
GPU | GPU-specific guides | GPU-based inference |
Neuron | Neuron guides | Inferentia-based inference |
Next Steps
- Start with Inference Charts for the fastest path to deployment
- Explore hardware-specific guides for optimized configurations
- Set up monitoring and observability for production workloads