Skip to main content

Inference on EKS

AI on EKS provides comprehensive solutions for deploying AI/ML inference workloads on Amazon EKS, supporting both GPU and AWS Neuron (Inferentia/Trainium) hardware configurations.

Quick Start Options

Get started quickly with our pre-configured Helm charts that support multiple models and deployment patterns:

  • Inference Charts - Streamlined Helm-based deployments with pre-configured values for popular models
  • Supports both GPU and Neuron hardware
  • Includes VLLM and Ray-VLLM frameworks
  • Pre-configured for 10+ popular models including Llama, DeepSeek, and Mistral

Hardware-Specific Guides

GPU Deployments

Explore GPU-specific inference solutions:

Neuron Deployments (AWS Inferentia)

Leverage AWS Inferentia chips for cost-effective inference:

Architecture Overview

AI on EKS inference solutions support multiple deployment patterns:

  • Single-node inference with vLLM
  • Distributed inference with Ray-vLLM
  • Production-ready deployments with load balancing
  • Auto-scaling capabilities
  • Observability and monitoring integration

Choosing the Right Approach

Use CaseRecommended SolutionBenefits
Quick prototypingInference ChartsPre-configured, fast deployment
GPUGPU-specific guidesGPU-based inference
NeuronNeuron guidesInferentia-based inference

Next Steps

  1. Start with Inference Charts for the fastest path to deployment
  2. Explore hardware-specific guides for optimized configurations
  3. Set up monitoring and observability for production workloads