AI on EKS Inference Charts

The AI on EKS Inference Charts provide a streamlined Helm-based approach to deploy AI/ML inference workloads on both GPU and AWS Neuron (Inferentia/Trainium) hardware. This chart supports multiple deployment configurations and comes with pre-configured values for popular models.

Overview

The inference charts support the following deployment types:

GPU-based VLLM deployments - Single-node VLLM inference
GPU-based Ray-VLLM deployments - Distributed VLLM inference with Ray
Neuron-based VLLM deployments - VLLM inference on AWS Inferentia chips
Neuron-based Ray-VLLM deployments - Distributed VLLM inference with Ray on Inferentia

Prerequisites

Before deploying the inference charts, ensure you have:

Amazon EKS cluster with GPU or AWS Neuron nodes (JARK-stack for a quick start)
Helm 3.0+
For GPU deployments: NVIDIA device plugin installed
For Neuron deployments: AWS Neuron device plugin installed
Hugging Face Hub token (stored as a Kubernetes secret)

Quick Start

1. Create Hugging Face Token Secret

Create a Kubernetes secret with your Hugging Face token:

kubectl create secret generic hf-token --from-literal=token=your_huggingface_token

2. Deploy a Pre-configured Model

Choose from the available pre-configured models and deploy:

warning

These deployments will need GPU/Neuron resources which need to be enabled and cost more than CPU only instances.

# Deploy Llama 3.2 1B on GPU with VLLM
helm install llama-inference ./blueprints/inference/inference-charts \
  --values ./blueprints/inference/inference-charts/values-llama-32-1b-vllm.yaml

# Deploy DeepSeek R1 Distill on GPU with Ray-VLLM
helm install deepseek-inference ./blueprints/inference/inference-charts \
  --values ./blueprints/inference/inference-charts/values-deepseek-r1-distill-llama-8b-ray-vllm-gpu.yaml

Supported Models

The inference charts include pre-configured values files for the following models:

GPU Models

Model	Size	Framework	Values File
DeepSeek R1 Distill Llama	8B	Ray-VLLM	`values-deepseek-r1-distill-llama-8b-ray-vllm-gpu.yaml`
Llama 3.2	1B	VLLM	`values-llama-32-1b-vllm.yaml`
Llama 3.2	1B	Ray-VLLM	`values-llama-32-1b-ray-vllm.yaml`
Llama 4 Scout	17B	VLLM	`values-llama-4-scout-17b-vllm.yaml`
Mistral Small	24B	Ray-VLLM	`values-mistral-small-24b-ray-vllm.yaml`

Neuron Models (AWS Inferentia/Trainium)

Model	Size	Framework	Values File
DeepSeek R1 Distill Llama	8B	VLLM	`values-deepseek-r1-distill-llama-8b-vllm-neuron.yaml`
Llama 2	13B	Ray-VLLM	`values-llama-2-13b-ray-vllm-neuron.yaml`
Llama 3	70B	Ray-VLLM	`values-llama-3-70b-ray-vllm-neuron.yaml`
Llama 3.1	8B	VLLM	`values-llama-31-8b-vllm-neuron.yaml`
Llama 3.1	8B	Ray-VLLM	`values-llama-31-8b-ray-vllm-neuron.yaml`

Deployment Examples

GPU Deployments

Deploy Llama 3.2 1B with VLLM

helm install llama32-vllm ./blueprints/inference/inference-charts \
  --values ./blueprints/inference/inference-charts/values-llama-32-1b-vllm.yaml

Deploy DeepSeek R1 Distill with Ray-VLLM

helm install deepseek-ray ./blueprints/inference/inference-charts \
  --values ./blueprints/inference/inference-charts/values-deepseek-r1-distill-llama-8b-ray-vllm-gpu.yaml

Deploy Mistral Small 24B with Ray-VLLM

helm install mistral-ray ./blueprints/inference/inference-charts \
  --values ./blueprints/inference/inference-charts/values-mistral-small-24b-ray-vllm.yaml

Neuron Deployments

Deploy Llama 3.1 8B with VLLM on Inferentia

helm install llama31-neuron ./blueprints/inference/inference-charts \
  --values ./blueprints/inference/inference-charts/values-llama-31-8b-vllm-neuron.yaml

Deploy Llama 3 70B with Ray-VLLM on Inferentia

helm install llama3-70b-neuron ./blueprints/inference/inference-charts \
  --values ./blueprints/inference/inference-charts/values-llama-3-70b-ray-vllm-neuron.yaml

Configuration Options

Key Parameters

The chart provides extensive configuration options. Here are the most important parameters:

Parameter	Description	Default
`inference.accelerator`	Accelerator type (`gpu` or `neuron`)	`gpu`
`inference.framework`	Framework type (`vllm` or `ray-vllm`)	`vllm`
`inference.serviceName`	Name of the inference service	`inference`
`inference.modelServer.deployment.replicas`	Number of replicas	`1`
`modelParameters.modelId`	Model ID from Hugging Face Hub	`NousResearch/Llama-3.2-1B`
`modelParameters.gpuMemoryUtilization`	GPU memory utilization	`0.8`
`modelParameters.maxModelLen`	Maximum model sequence length	`8192`
`modelParameters.tensorParallelSize`	Tensor parallel size	`1`
`service.type`	Service type	`ClusterIP`
`service.port`	Service port	`8000`

Custom Deployment

Create your own values file for custom configurations:

inference:
  accelerator: gpu  # or neuron
  framework: vllm   # or ray-vllm
  serviceName: custom-inference
  modelServer:
    deployment:
      replicas: 2
      resources:
        gpu:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1

modelParameters:
  modelId: "your-custom-model-id"
  gpuMemoryUtilization: "0.9"
  maxModelLen: "4096"
  tensorParallelSize: "1"

Deploy with custom values:

helm install custom-inference ./blueprints/inference/inference-charts \
  --values custom-values.yaml

API Endpoints

Once deployed, the service exposes OpenAI-compatible API endpoints:

/v1/models - List available models
/v1/completions - Text completion API
/v1/chat/completions - Chat completion API
/metrics - Prometheus metrics endpoint

Example API Usage

Note: These deployments do not create an ingress, you will need to kubectl port-forward to test from your machine, eg (for deepseek):

kubectl get svc | grep deepseek
# Note the service name for deepseek, in this case deepseekr1-dis-lllama-8b-ray-vllm-gpu-ray-vllm
kubectl port-forward svc/deepseekr1-dis-llama-8b-ray-vllm-gpu-ray-vllm 8000

# List models
curl http://localhost:8000/v1/models

# Chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Monitoring and Observability

The charts include built-in observability features:

Fluent Bit for log collection
Prometheus metrics for monitoring
Grafana dashboards for visualizations

Access metrics at the /metrics endpoint of your deployed service.

Troubleshooting

Common Issues

Pod stuck in Pending state
- Check if GPU/Neuron nodes are available
- Verify resource requests match available hardware
Model download failures
- Ensure Hugging Face token is correctly configured
- Check network connectivity to Hugging Face Hub
Out of memory errors
- Adjust gpuMemoryUtilization parameter
- Consider using tensor parallelism for larger models

Logs

Check deployment logs:

kubectl logs -l app=inference-server

For Ray deployments, check Ray cluster status:

kubectl exec -it <ray-head-pod> -- ray status

Next Steps

Explore GPU-specific configurations for GPU deployments
Learn about Neuron-specific configurations for Inferentia deployments

Overview​

Prerequisites​

Quick Start​

1. Create Hugging Face Token Secret​

2. Deploy a Pre-configured Model​

Supported Models​

GPU Models​

Neuron Models (AWS Inferentia/Trainium)​

Deployment Examples​

GPU Deployments​

Deploy Llama 3.2 1B with VLLM​

Deploy DeepSeek R1 Distill with Ray-VLLM​

Deploy Mistral Small 24B with Ray-VLLM​

Neuron Deployments​

Deploy Llama 3.1 8B with VLLM on Inferentia​

Deploy Llama 3 70B with Ray-VLLM on Inferentia​

Configuration Options​

Key Parameters​

Custom Deployment​

API Endpoints​

Example API Usage​

Monitoring and Observability​

Troubleshooting​

Common Issues​

Logs​

Next Steps​