Skip to main content

Inference on EKS

Deploy and run Large Language Models (LLMs) and other AI models on Amazon EKS.

What's in This Section

This section provides practical deployment guides and Helm charts for running inference workloads on EKS. Whether you're deploying open-source LLMs, diffusion models, or custom AI models, you'll find ready-to-use configurations and step-by-step instructions.


Inference Charts

Helm charts for deploying popular AI models on EKS with pre-configured values for optimal performance.

What You Get:

  • Ready-to-deploy Helm charts for vLLM, Ray-vLLM, Triton, and Diffusers
  • Pre-configured values files for popular models (Llama, DeepSeek, Mistral, Stable Diffusion, and more)
  • Support for both GPU (NVIDIA) and Neuron (AWS Inferentia/Trainium) deployments
  • Configurations with health checks, autoscaling, and monitoring

Use Cases:

  • Quick deployment of open-source LLMs
  • Standardized deployment patterns across your organization
  • Reference implementations for custom model deployments

Explore Inference Charts →


Framework-Specific Deployment Guides

Detailed guides for deploying models with deep dive into specific frameworks on EKS, organized by hardware type.

GPU Deployments

Step-by-step guides for deploying models on NVIDIA GPUs:

Neuron Deployments

Step-by-step guides for deploying models on AWS Inferentia and Trainium:


Getting Started

  1. Set up your infrastructure - Start with the Inference-Ready Cluster to provision an EKS cluster optimized for AI/ML workloads

  2. Choose your deployment method:

    • For quick deployments with popular models → Use Inference Charts
    • For specific frameworks or custom configurations → See Framework-Specific Guides above
  3. Optimize your deployment - Apply best practices from the Guidance section to improve performance and reduce costs


Need Help?