Skip to main content

AI on EKS Inference Charts

The AI on EKS Inference Charts provide a streamlined Helm-based approach to deploy AI/ML inference workloads on both GPU and AWS Neuron (Inferentia/Trainium) hardware. This chart supports multiple deployment configurations and comes with pre-configured values for popular models.

Overview

The inference charts support the following deployment types:

  • GPU-based VLLM deployments - Single-node VLLM inference
  • GPU-based Ray-VLLM deployments - Distributed VLLM inference with Ray
  • Neuron-based VLLM deployments - VLLM inference on AWS Inferentia chips
  • Neuron-based Ray-VLLM deployments - Distributed VLLM inference with Ray on Inferentia

Prerequisites

Before deploying the inference charts, ensure you have:

  • Amazon EKS cluster with GPU or AWS Neuron nodes (JARK-stack for a quick start)
  • Helm 3.0+
  • For GPU deployments: NVIDIA device plugin installed
  • For Neuron deployments: AWS Neuron device plugin installed
  • Hugging Face Hub token (stored as a Kubernetes secret)

Quick Start

1. Create Hugging Face Token Secret

Create a Kubernetes secret with your Hugging Face token:

kubectl create secret generic hf-token --from-literal=token=your_huggingface_token

2. Deploy a Pre-configured Model

Choose from the available pre-configured models and deploy:

warning

These deployments will need GPU/Neuron resources which need to be enabled and cost more than CPU only instances.

# Deploy Llama 3.2 1B on GPU with VLLM
helm install llama-inference ./blueprints/inference/inference-charts \
--values ./blueprints/inference/inference-charts/values-llama-32-1b-vllm.yaml

# Deploy DeepSeek R1 Distill on GPU with Ray-VLLM
helm install deepseek-inference ./blueprints/inference/inference-charts \
--values ./blueprints/inference/inference-charts/values-deepseek-r1-distill-llama-8b-ray-vllm-gpu.yaml

Supported Models

The inference charts include pre-configured values files for the following models:

GPU Models

ModelSizeFrameworkValues File
DeepSeek R1 Distill Llama8BRay-VLLMvalues-deepseek-r1-distill-llama-8b-ray-vllm-gpu.yaml
Llama 3.21BVLLMvalues-llama-32-1b-vllm.yaml
Llama 3.21BRay-VLLMvalues-llama-32-1b-ray-vllm.yaml
Llama 4 Scout17BVLLMvalues-llama-4-scout-17b-vllm.yaml
Mistral Small24BRay-VLLMvalues-mistral-small-24b-ray-vllm.yaml

Neuron Models (AWS Inferentia/Trainium)

ModelSizeFrameworkValues File
DeepSeek R1 Distill Llama8BVLLMvalues-deepseek-r1-distill-llama-8b-vllm-neuron.yaml
Llama 213BRay-VLLMvalues-llama-2-13b-ray-vllm-neuron.yaml
Llama 370BRay-VLLMvalues-llama-3-70b-ray-vllm-neuron.yaml
Llama 3.18BVLLMvalues-llama-31-8b-vllm-neuron.yaml
Llama 3.18BRay-VLLMvalues-llama-31-8b-ray-vllm-neuron.yaml

Deployment Examples

GPU Deployments

Deploy Llama 3.2 1B with VLLM

helm install llama32-vllm ./blueprints/inference/inference-charts \
--values ./blueprints/inference/inference-charts/values-llama-32-1b-vllm.yaml

Deploy DeepSeek R1 Distill with Ray-VLLM

helm install deepseek-ray ./blueprints/inference/inference-charts \
--values ./blueprints/inference/inference-charts/values-deepseek-r1-distill-llama-8b-ray-vllm-gpu.yaml

Deploy Mistral Small 24B with Ray-VLLM

helm install mistral-ray ./blueprints/inference/inference-charts \
--values ./blueprints/inference/inference-charts/values-mistral-small-24b-ray-vllm.yaml

Neuron Deployments

Deploy Llama 3.1 8B with VLLM on Inferentia

helm install llama31-neuron ./blueprints/inference/inference-charts \
--values ./blueprints/inference/inference-charts/values-llama-31-8b-vllm-neuron.yaml

Deploy Llama 3 70B with Ray-VLLM on Inferentia

helm install llama3-70b-neuron ./blueprints/inference/inference-charts \
--values ./blueprints/inference/inference-charts/values-llama-3-70b-ray-vllm-neuron.yaml

Configuration Options

Key Parameters

The chart provides extensive configuration options. Here are the most important parameters:

ParameterDescriptionDefault
inference.acceleratorAccelerator type (gpu or neuron)gpu
inference.frameworkFramework type (vllm or rayVllm)vllm
inference.serviceNameName of the inference serviceinference
inference.modelServer.deployment.replicasNumber of replicas1
modelParameters.modelIdModel ID from Hugging Face HubNousResearch/Llama-3.2-1B
modelParameters.gpuMemoryUtilizationGPU memory utilization0.8
modelParameters.maxModelLenMaximum model sequence length8192
modelParameters.tensorParallelSizeTensor parallel size1
service.typeService typeClusterIP
service.portService port8000

Custom Deployment

Create your own values file for custom configurations:

inference:
accelerator: gpu # or neuron
framework: vllm # or rayVllm
serviceName: custom-inference
modelServer:
deployment:
replicas: 2
resources:
gpu:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1

modelParameters:
modelId: "your-custom-model-id"
gpuMemoryUtilization: "0.9"
maxModelLen: "4096"
tensorParallelSize: "1"

Deploy with custom values:

helm install custom-inference ./blueprints/inference/inference-charts \
--values custom-values.yaml

API Endpoints

Once deployed, the service exposes OpenAI-compatible API endpoints:

  • /v1/models - List available models
  • /v1/completions - Text completion API
  • /v1/chat/completions - Chat completion API
  • /metrics - Prometheus metrics endpoint

Example API Usage

Note: These deployments do not create an ingress, you will need to kubectl port-forward to test from your machine, eg (for deepseek):

kubectl get svc | grep deepseek
# Note the service name for deepseek, in this case deepseekr1-dis-lllama-8b-ray-vllm-gpu-ray-vllm
kubectl port-forward svc/deepseekr1-dis-llama-8b-ray-vllm-gpu-ray-vllm 8000
# List models
curl http://localhost:8000/v1/models

# Chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'

Monitoring and Observability

The charts include built-in observability features:

  • Fluent Bit for log collection
  • Prometheus metrics for monitoring
  • Grafana dashboards for visualizations

Access metrics at the /metrics endpoint of your deployed service.

Troubleshooting

Common Issues

  1. Pod stuck in Pending state

    • Check if GPU/Neuron nodes are available
    • Verify resource requests match available hardware
  2. Model download failures

    • Ensure Hugging Face token is correctly configured
    • Check network connectivity to Hugging Face Hub
  3. Out of memory errors

    • Adjust gpuMemoryUtilization parameter
    • Consider using tensor parallelism for larger models

Logs

Check deployment logs:

kubectl logs -l app=inference-server

For Ray deployments, check Ray cluster status:

kubectl exec -it <ray-head-pod> -- ray status

Next Steps