Use of Llama 4 models is governed by the Meta Llama License. Please visit Hugging Face and accept the license before requesting access.
Llama 4 Inference with vLLM on AWS Trainium
This guide covers deploying Llama 4 models using vLLM with optimum-neuron on AWS Trainium instances.
Llama 4 inference on Neuron is supported via optimum-neuron >= 0.4.0 with the Llama4NeuronModelForCausalLM class. However, the first deployment requires Neuron model compilation, which happens automatically when vllm serve runs but can take 30-60+ minutes. Pre-compiled artifacts may not yet be available in the optimum-neuron-cache for all configurations.
The optimum-cli export neuron command does not support Llama 4. Use vllm serve directly, which invokes the inference-path compilation internally.
Why Trainium for Llama 4?
AWS Trainium provides large HBM memory capacity, making it an excellent choice for large MoE models like Llama 4:
| Instance | Chips | NeuronCores | HBM Memory | Karpenter | EKS Auto Mode |
|---|---|---|---|---|---|
| trn1.32xlarge | 16 Trainium v1 | 32 | 512 GiB | Supported | Supported |
| trn2.48xlarge | 16 Trainium v2 | 64 | 1.5 TiB | Supported | Not yet supported |
| Advantage | Detail |
|---|---|
| No quantization needed | Both trn1 (512 GiB) and trn2 (1.5 TiB) support Scout (~220 GiB) in native BF16 |
| Karpenter auto-provisioning | Neuron NodePool provisions Trainium nodes on-demand when workloads are scheduled |
| trn2 for Maverick | trn2.48xlarge (1.5 TiB) supports Maverick (~800 GiB) in BF16 without quantization |
Memory Requirements
| Model | BF16 Memory | trn1.32xlarge (512 GiB) | trn2.48xlarge (1.5 TiB) |
|---|---|---|---|
| Scout 17B-16E | ~220 GiB | Fits in BF16 | Fits in BF16 |
| Maverick 17B-128E | ~800 GiB | Does not fit | Fits in BF16 |
For Maverick, only trn2.48xlarge has sufficient memory (1.5 TiB) for BF16. trn1.32xlarge (512 GiB) is insufficient.
Trainium instance availability varies by region. Check the AWS EC2 Instance Types by Region page for current availability before deploying your infrastructure.
- trn2.48xlarge: Not supported by EKS Auto Mode — use Karpenter with the inference-ready cluster.
Model Compilation
The AWS Neuron DLC uses optimum-neuron to run vLLM on Trainium. Models must be pre-compiled for Neuron before serving. The DLC checks the optimum-neuron-cache on Hugging Face for pre-compiled model artifacts matching your configuration (model, batch size, sequence length, tensor parallelism, dtype).
The optimum-cli export neuron command does not support llama4 as a model type. However, vllm serve uses a separate inference code path (optimum.neuron.models.inference.llama4) that includes full MoE support via Llama4NeuronModelForCausalLM. Compilation is triggered automatically on first serve.
Software Versions
| Component | Version | Notes |
|---|---|---|
| Neuron SDK | 2.26.1 | Required |
| optimum-neuron | >= 0.4.0 | Llama 4 inference support added in v0.4.0 |
| vLLM | 0.11.0 | With optimum-neuron Neuron platform plugin |
| neuronx-distributed | 0.15 | MoE module used by Llama 4 inference |
| DLC Image | 763104351884.dkr.ecr.<region>.amazonaws.com/huggingface-vllm-inference-neuronx:0.11.0-optimum0.4.5-neuronx-py310-sdk2.26.1-ubuntu22.04 | Latest available |
Deploying the Inference-Ready EKS Cluster
👈Deploy Llama 4 Scout on Trainium
Step 1: Create Hugging Face Token Secret
kubectl create secret generic hf-token --from-literal=token=<your-huggingface-token>
Step 2: Deploy with Helm
For trn2.48xlarge (Scout):
helm repo add ai-on-eks https://awslabs.github.io/ai-on-eks-charts/
helm repo update
helm install llama4-scout-neuron ai-on-eks/inference-charts \
--values https://raw.githubusercontent.com/awslabs/ai-on-eks-charts/refs/heads/main/charts/inference-charts/values-llama-4-scout-17b-vllm-neuron.yaml
Key deployment parameters:
- tensor_parallel_size: 16 (one per Trainium chip, not per NeuronCore)
- Docker image: AWS Neuron DLC from private ECR (
763104351884.dkr.ecr.<region>.amazonaws.com/huggingface-vllm-inference-neuronx) - Neuron device requests:
aws.amazon.com/neuron: 16for all 16 chips - CPU memory:
384Giminimum (weight sharding requires loading the full model into CPU memory) - Instance type:
trn2.48xlarge(default for both Scout and Maverick) - Environment variable:
VLLM_NEURON_FRAMEWORK=optimumis required for on-the-fly Neuron compilation
Step 3: Monitor Deployment
After deploying, Karpenter will automatically provision a Trainium node:
# Watch node provisioning
kubectl get nodeclaims -w
# Check pod status
kubectl get pods -w
During deployment, the pod will go through these stages:
- Pending - waiting for Trainium node provisioning (~5 minutes)
- ContainerCreating - pulling the Neuron DLC image (~2.9 GiB)
- Running - Neuron model compilation (30-60+ minutes on first run)
- Ready - vLLM server is serving requests
The pod requires at least 384 GiB of CPU memory for model weight sharding across 16 Neuron devices. With insufficient memory (e.g., 64 GiB), the pod will be OOMKilled during weight loading. The trn2.48xlarge instance provides ~2 TiB of system memory, so this is well within capacity.
The first deployment takes significantly longer due to Neuron model compilation. Subsequent deployments with the same configuration will use cached artifacts. Monitor the compilation progress in the logs:
kubectl logs -f -l app.kubernetes.io/instance=llama4-scout-neuron
Tested deployment timeline on trn2.48xlarge (Scout):
| Phase | Duration | Description |
|---|---|---|
| Node provisioning | ~5 min | Karpenter provisions trn2.48xlarge |
| Image pull | ~30 sec | DLC image (~2.9 GiB, cached after first pull) |
| HLO generation | ~60 sec | Generates HLOs for context_encoding and token_generation |
| Neuron compilation | ~200 sec | neuronx-cc compiles HLOs to NEFFs (target=trn2) |
| Model build | ~650 sec | Weight layout transformation |
| Weight loading | ~5 min | Download, shard, and load weights to 16 Neuron devices |
| Total (first deploy) | ~20 min | Subsequent deploys reuse cached compilation artifacts |
Once complete, the vLLM server will start:
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
Deploy Llama 4 Maverick on Trainium2
Maverick requires trn2.48xlarge (1.5 TiB HBM) and runs in native BF16 without quantization. Ensure your cluster has the trn2-neuron Karpenter NodePool configured (see cluster setup above).
No manual model compilation is needed. Like Scout, vllm serve automatically triggers JIT compilation via optimum-neuron on first startup. Ensure the pod has sufficient startup time configured (liveness/readiness probe initialDelaySeconds) to allow compilation to complete without Kubernetes restarting the pod.
helm install llama4-maverick-neuron ai-on-eks/inference-charts \
--values https://raw.githubusercontent.com/awslabs/ai-on-eks-charts/refs/heads/main/charts/inference-charts/values-llama-4-maverick-17b-vllm-neuron.yaml
trn2.48xlargeavailability is limited. Check AWS EC2 Instance Types by Region before deploying.- Ensure your AWS account has sufficient service quota for Trainium instances (Maverick requires 192 vCPUs).
Persisting the Compilation Cache
By default, Neuron compilation artifacts are stored in ephemeral container storage (/var/tmp/neuron-compile-cache/). This means recompilation will occur on every pod restart, adding ~20 minutes of startup time. For production deployments, persist the cache using one of these approaches:
Option 1: S3-Backed Cache (Recommended)
Set the NEURON_COMPILE_CACHE_URL environment variable to store compiled artifacts in S3:
env:
- name: NEURON_COMPILE_CACHE_URL
value: "s3://your-bucket/neuron-compile-cache/"
This allows all pods (including replacements and scale-out replicas) to share the same compilation cache.
Option 2: PersistentVolume Mount
Mount a PersistentVolume to the compilation cache directory:
volumeMounts:
- name: neuron-cache
mountPath: /var/tmp/neuron-compile-cache
volumes:
- name: neuron-cache
persistentVolumeClaim:
claimName: neuron-compile-cache-pvc
The optimum-neuron-cache on Hugging Face is checked automatically before local compilation. If pre-compiled artifacts for your exact configuration (model, batch size, sequence length, tensor parallelism, dtype) are available, they will be downloaded instead of recompiled. As Llama 4 configurations are added to the cache, cold-start times will improve.
Test the Model
Port Forward
kubectl port-forward svc/llama4-scout-neuron 8000:8000
Chat Completion Request
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [
{"role": "user", "content": "Explain the benefits of Mixture of Experts architecture in large language models."}
],
"max_tokens": 512,
"temperature": 0.7
}'
List Available Models
curl http://localhost:8000/v1/models | python3 -m json.tool
Multimodal Request (Text + Image)
Llama 4 supports multimodal inference. Send image URLs alongside text:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe what you see in this image."},
{"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}}
]
}
],
"max_tokens": 256
}'
Deploy Open WebUI
Open WebUI provides a ChatGPT-style interface for interacting with the model.
helm repo add open-webui https://helm.openwebui.com/
helm repo update
helm install open-webui open-webui/open-webui \
--namespace open-webui --create-namespace \
--set ollama.enabled=false \
--set env.OPENAI_API_BASE_URL=http://llama4-scout-neuron.default.svc.cluster.local:8000/v1 \
--set env.OPENAI_API_KEY=dummy
Access the UI:
kubectl port-forward svc/open-webui 8080:80 -n open-webui
Open http://localhost:8080 in your browser and register a new account. The model will appear in the model selector.
Monitoring
Check Inference Logs
# View vLLM Neuron logs
kubectl logs -l app.kubernetes.io/instance=llama4-scout-neuron --tail=100
# Monitor token generation throughput
kubectl logs -l app.kubernetes.io/instance=llama4-scout-neuron -f | grep "tokens/s"
Observability Dashboard
If the observability stack is enabled on your cluster, access Grafana:
kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring
Cleanup
Remove the model deployment:
# Remove Scout
helm uninstall llama4-scout-neuron
# Remove Maverick (if deployed)
helm uninstall llama4-maverick-neuron
To destroy the entire cluster infrastructure:
cd ai-on-eks/infra/solutions/inference-ready-cluster
./cleanup.sh