NVIDIA NIM Operator on Amazon EKS

What is NVIDIA NIM?

NVIDIA NIM (NVIDIA Inference Microservices) is a set of containerized microservices that make it easier to deploy and host large language models (LLMs) and other AI models in your own environment. NIM provides standard APIs (similar to OpenAI or other AI services) for developers to build applications like chatbots and AI assistants, while leveraging NVIDIA’s GPU acceleration for high-performance inference. In essence, NIM abstracts away the complexities of model runtime and optimization, offering a fast path to inference with optimized backends (e.g., TensorRT-LLM, FasterTransformer, etc.) under the hood.

NVIDIA NIM Operator for Kubernetes

The NVIDIA NIM Operator is a Kubernetes operator that automates the deployment, scaling, and management of NVIDIA NIM microservices on a Kubernetes cluster.

NVIDIA NIM Operator Architecture

Instead of manually pulling containers, provisioning GPU nodes, or writing YAML for every model, the NIM Operator introduces three primary Custom Resource Definitions (CRDs):

NIMCache
NIMService
[NIMPipeline] (https://docs.nvidia.com/nim-operator/latest/pipelines.html)

These CRDs allow you to declaratively define model deployments using native Kubernetes syntax.

The Operator handles:

Pulling the model image from NVIDIA GPU Cloud (NGC)
Caching model weights and optimized runtime profiles
Launching model-serving pods with GPU allocation
Exposing inference endpoints via Kubernetes Services
Integrating with autoscaling (e.g., HPA + Karpenter)
Chaining multiple models together into inference pipelines using NIMPipeline

NIMCache – Model Caching for Faster Load Times

A NIMCache (nimcaches.apps.nvidia.com) is a custom resource that pre-downloads and stores a model’s weights, tokenizer, and runtime-optimized engine files (such as TensorRT-LLM profiles) into a shared persistent volume.

This ensures:

Faster cold start times: no repeated downloads from NGC
Storage reuse across nodes and replicas
Centralized, shared model store (typically on EFS or FSx for Lustre in EKS)

Model profiles are optimized for specific GPUs (e.g., A10G, L4) and precisions (e.g., FP16). When NIMCache is created, the Operator discovers available model profiles and selects the best one for your cluster.

📌 Tip: Using NIMCache is highly recommended for production, especially when running multiple replicas or restarting models frequently.

NIMService – Deploying and Managing the Model Server

A NIMService (nimservices.apps.nvidia.com) represents a running instance of a NIM model server on your cluster. It specifies the container image, GPU resources, number of replicas, and optionally the name of a NIMCache.

Key benefits:

Declarative model deployment using Kubernetes YAML
Automatic node scheduling for GPU nodes
Shared cache support using NIMCache
Autoscaling via HPA or external triggers
ClusterIP or Ingress support to expose APIs

For example, deploying the Meta Llama 3.1 8B Instruct model involves creating:

A NIMCache to store the model (optional but recommended)
A NIMService pointing to the cached model and allocating GPUs

If NIMCache is not used, the model will be downloaded each time a pod starts, which may increase startup latency.

NIMPipeline

NIMPipeline is another CRD that can group multiple NIMService resources into an ordered inference pipeline. This is useful for multi-model workflows like:

Retrieval-Augmented Generation (RAG)
Embeddings + LLM chaining
Preprocessing + classification pipelines

In this tutorial, we focus on a single model deployment using NIMCache and NIMService.

Overview of this deployment pattern on Amazon EKS

This deployment blueprint demonstrates how to run the Meta Llama 3.1 8B Instruct model on Amazon EKS using the NVIDIA NIM Operator with multi-GPU support and optimized model caching for fast startup times.

NVIDIA NIM Operator Architecture

The model is served using:

G5 instances (g5.12xlarge): These instances come with 4 NVIDIA A10G GPUs
Tensor Parallelism (TP): Set to 2, meaning the model will run in parallel across 2 GPUs
Persistent Shared Cache: Backed by Amazon EFS to speed up model startup by reusing previously generated engine files

By combining these components, the model is deployed as a scalable Kubernetes workload that supports:

Efficient GPU scheduling with Karpenter
Fast model load using the NIMCache
Scalable serving endpoint via NIMService

📌 Note: You can modify the tensorParallelism setting or select a different instance type (e.g., G6 with L4 GPUs) based on your performance and cost requirements.

warning

Note: Before implementing NVIDIA NIM, please be aware it is part of NVIDIA AI Enterprise, which may introduce potential cost and licensing for production use.

For evaluation, NVIDIA also offers a free evaluation license to try NVIDIA AI Enterprise for 90 days, and you can register it with your corporate email.

Deploying the Solution

In this tutorial, the entire AWS infrastructure is provisioned using Terraform, including:

Amazon VPC with public and private subnets
Amazon EKS cluster
GPU nodepools using Karpenter
Addons such as:
- NVIDIA device plugin
- EFS CSI driver
- NVIDIA NIM Operator

As a demonstration, the Meta Llama-3.1 8B Instruct model will be deployed using a NIMService, optionally backed by a NIMCache for improved cold start performance.

Prerequisites

Before getting started with NVIDIA NIM, ensure you have the following:

Click to expand the NVIDIA NIM account setup details

NVIDIA AI Enterprise Account

Register for an NVIDIA AI Enterprise account. If you don't have one, you can sign up for a trial account using this link.

NGC API Key

Log in to your NVIDIA AI Enterprise account
Navigate to the NGC (NVIDIA GPU Cloud) portal
Generate a personal API key:
- Go to your account settings or navigate directly to: https://org.ngc.nvidia.com/setup/personal-keys
- Click on "Generate Personal Key"
- Ensure that at least "NGC Catalog" is selected from the "Services Included" dropdown
- Copy and securely store your API key, the key should have a prefix with nvapi-

Validate NGC API Key and Test Image Pull

To ensure your API key is valid and working correctly:

Set up your NGC API key as an environment variable:

export NGC_API_KEY=<your_api_key_here>

Authenticate Docker with the NVIDIA Container Registry:

echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Test pulling an image from NGC:

docker pull nvcr.io/nim/meta/llama3-8b-instruct:latest

You do not have to wait for it to complete, just to make sure the API key is valid to pull the image.

The following are required to run this tutorial

An active AWS account with admin equivalent permissions
aws cli
kubectl
Terraform

Deploy

Clone the ai-on-eks repository that contains the Terraform code for this deployment pattern:

git clone https://github.com/awslabs/ai-on-eks.git

Navigate to the NVIDIA NIM deployment directory and run the install script to deploy the infrastructure:

cd ai-on-eks/infra/nvidia-nim
./install.sh

This deployment will take approximately ~20 minute to complete.

Once the installation finishes, you may find the configure_kubectl command from the output. Run the following to configure EKS cluster access

# Creates k8s config file to authenticate with EKS
aws eks --region us-west-2 update-kubeconfig --name nvidia-nim-eks

Verify the deployments - Click to expand the deployment details

$ kubectl get all -n nim-operator

kubectl get all -n nim-operator
NAME                                                 READY   STATUS    RESTARTS   AGE
pod/nim-operator-k8s-nim-operator-6fdffdf97f-56fxc   1/1     Running   0          26h

NAME                                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/k8s-nim-operator-metrics-service   ClusterIP   172.20.148.6   <none>        8080/TCP   26h

NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nim-operator-k8s-nim-operator   1/1     1            1           26h

NAME                                                       DESIRED   CURRENT   READY   AGE
replicaset.apps/nim-operator-k8s-nim-operator-6fdffdf97f   1         1         1       26h

$ kubectl get crds | grep nim

nimcaches.apps.nvidia.com                    2025-03-27T17:39:00Z
nimpipelines.apps.nvidia.com                 2025-03-27T17:39:00Z
nimservices.apps.nvidia.com                  2025-03-27T17:39:01Z

$ kubectl get crds | grep nemo

nemocustomizers.apps.nvidia.com              2025-03-27T17:38:59Z
nemodatastores.apps.nvidia.com               2025-03-27T17:38:59Z
nemoentitystores.apps.nvidia.com             2025-03-27T17:38:59Z
nemoevaluators.apps.nvidia.com               2025-03-27T17:39:00Z
nemoguardrails.apps.nvidia.com               2025-03-27T17:39:00Z

To list Karpenter autoscaling Nodepools

$ kubectl get nodepools

NAME                NODECLASS           NODES   READY   AGE
g5-gpu-karpenter    g5-gpu-karpenter    1       True    47h
g6-gpu-karpenter    g6-gpu-karpenter    0       True    7h56m
inferentia-inf2     inferentia-inf2     0       False   47h
trainium-trn1       trainium-trn1       0       False   47h
x86-cpu-karpenter   x86-cpu-karpenter   0       True    47h

Deploy llama-3.1-8b-instruct with NIM Operator

Step 1: Create Secrets for Authentication

To access the NVIDIA container registry and model artifacts, you'll need to provide your NGC API key. This script creates two Kubernetes secrets: ngc-secret for Docker image pulls and ngc-api-secret for model authorization.

cd blueprints/inference/gpu/nvidia-nim-operator-llama3-8b

NGC_API_KEY="your-real-ngc-key" ./deploy-nim-auth.sh

Step 2: Cache the Model to EFS using NIMCache CRD

The NIMCache custom resource will pull the model and cache optimized engine profiles to EFS. This dramatically reduces startup time when launching the model later via NIMService.

cd blueprints/inference/gpu/nvidia-nim-operator-llama3-8b

kubectl apply -f nim-cache-llama3-8b-instruct.yaml

Check status:

kubectl get nimcaches.apps.nvidia.com -n nim-service

Expected output:

NAME                      STATUS   PVC                           AGE
meta-llama3-8b-instruct   Ready    meta-llama3-8b-instruct-pvc   21h

Display cached model profiles:

kubectl get nimcaches.apps.nvidia.com -n nim-service \
  meta-llama3-8b-instruct -o=jsonpath="{.status.profiles}" | jq .

Sample output:

[
  {
    "config": {
      "feat_lora": "false",
      "gpu": "A10G",
      "llm_engine": "tensorrt_llm",
      "precision": "fp16",
      "profile": "throughput",
      "tp": "2"
    }
  }
]

Step 3: Deploy the Model using NIMService CRD

Now launch the model service using the cached engine profiles.

cd blueprints/inference/gpu/nvidia-nim-operator-llama3-8b

kubectl apply -f nim-service-llama3-8b-instruct.yaml

Check the deployed resources:

kubectl get all -n nim-service

Exepcted Output:

NAME                                           READY   STATUS    RESTARTS   AGE
pod/meta-llama3-8b-instruct-6cdf47d6f6-hlbnf   1/1     Running   0          6h35m

NAME                              TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
service/meta-llama3-8b-instruct   ClusterIP   172.20.85.8   <none>        8000/TCP   6h35m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/meta-llama3-8b-instruct   1/1     1            1           6h35m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/meta-llama3-8b-instruct-6cdf47d6f6   1         1         1       6h35m

🚀 Model Startup Timeline

The following sample is captured from the pod/meta-llama3-8b-instruct-6cdf47d6f6-hlbnf log

Step	Timestamp	Description
Start	~20:00:50	Pod starts, NIM container logs begin
Profile Match	20:00:50.100	Detects and selects cached profile (tp=2)
Workspace Ready	20:00:50.132	Model workspace initialized via EFS in 0.126s
TensorRT Init	20:00:51.168	TensorRT-LLM engine begins setup
Engine Ready	20:01:06	Engine loaded and profiles activated (~16.6 GiB across 2 GPUs)
API Server Ready	20:02:11.036	FastAPI + Uvicorn starts
Health Check OK	20:02:18.781	`/v1/health/ready` endpoint returns 200 OK

⚡ Startup time (cold boot to ready): ~81 seconds thanks to cached engine on EFS.

Test the Model with a Prompt

Step 1: Port Forward the Model Service

Expose the model locally using port forwarding:

kubectl port-forward -n nim-service service/meta-llama3-8b-instruct 8001:8000

Step 2: Send a Sample Prompt Using curl

Run the following command to test the model with a chat prompt:

curl -X POST \
  http://localhost:8001/v1/chat/completions \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [
      {
        "role": "user",
        "content": "What should I do for a 4 day vacation at Cape Hatteras National Seashore?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 1024,
    "stream": false,
    "frequency_penalty": 0.0,
    "stop": ["STOP"]
  }'

Sample Response (Shortened):

{"id":"chat-061a9dba9179437fa24cab7f7c767f19","object":"chat.completion","created":1743215809,"model":"meta/llama-3.1-8b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Cape Hatteras National Seashore is a beautiful coastal destination with a rich history, pristine beaches,
...
exploration of the area's natural beauty and history. Feel free to modify it to suit your interests and preferences. Safe travels!"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":30,"total_tokens":773,"completion_tokens":743},"prompt_logprobs":null}%

🧠 The model is now running with Tensor Parallelism = 2 across two A10G GPUs, each utilizing approximately 21.4 GiB of memory. Thanks to NIMCache backed by EFS, the model loaded quickly and is ready for low-latency inference.

Open WebUI Deployment

info

Open WebUI is compatible only with models that work with the OpenAI API server and Ollama.

1. Deploy the WebUI

Deploy the Open WebUI by running the following command:

kubectl apply -f ai-on-eks/blueprints/inference/gpu/nvidia-nim-operator-llama3-8b/openai-webui-deployment.yaml

2. Port Forward to Access WebUI

Use kubectl port-forward to access the WebUI locally:

kubectl port-forward svc/open-webui 8081:80 -n openai-webui

3. Access the WebUI

Open your browser and go to http://localhost:8081

4. Sign Up

5. Start a New Chat

Click on New Chat and select the model from the dropdown menu, as shown in the screenshot below:

alt text

6. Enter Test Prompt

Enter your prompt, and you will see the streaming results, as shown below:

alt text

Performance Testing with NVIDIA GenAI-Perf Tool

GenAI-Perf is a command line tool for measuring the throughput and latency of generative AI models as served through an inference server.

GenAI-Perf can be used as standard tool to benchmark with other models deployed with inference server. But this tool requires a GPU. To make it easier, we provide you a pre-configured manifest genaiperf-deploy.yaml to run the tool.

cd ai-on-eks/blueprints/inference/gpu/nvidia-nim-operator-llama3-8b
kubectl apply -f genaiperf-deploy.yaml

Once the pod is ready with running status 1/1, can execute into the pod.

export POD_NAME=$(kubectl get po -l app=genai-perf -ojsonpath='{.items[0].metadata.name}')
kubectl exec -it $POD_NAME -- bash

Run the testing to the deployed NIM Llama3 model

genai-perf profile -m meta/llama-3.1-8b-instruct \
  --url meta-llama3-8b-instruct.nim-service:8000 \
  --service-kind openai \
  --endpoint-type chat \
  --num-prompts 100 \
  --synthetic-input-tokens-mean 200 \
  --synthetic-input-tokens-stddev 0 \
  --output-tokens-mean 100 \
  --output-tokens-stddev 0 \
  --concurrency 20 \
  --streaming \
  --tokenizer hf-internal-testing/llama-tokenizer

You should see similar output like the following

NIM Operator genai-perf result

You should be able to see the metrics that genai-perf collects, including Request latency, Out token throughput, Request throughput.

To understand the command line options, please refer to this documentation.

Grafana Dashboard

NVIDIA provided a Grafana dashboard to better visualize NIM status. In the Grafana dashboard, it contains several important metrics:

Time to First Token (TTFT): The latency between the initial inference request to the model and the return of the first token.
Inter-Token Latency (ITL): The latency between each token after the first.
Total Throughput: The total number of tokens generated per second by the NIM.

You can find more metrics description from this document.

NVIDIA LLM Server

You can monitor metrics such as Time-to-First-Token, Inter-Token-Latency, KV Cache Utilization metrics.

NVIDIA NIM Metrics

To view the Grafana dashboard to monitor these metrics, follow the steps below:

Click to expand details

1. Retrieve the Grafana password.

The password is saved in the AWS Secret Manager. Below Terraform command will show you the secret name.

terraform output grafana_secret_name

Then use the output secret name to run below command,

aws secretsmanager get-secret-value --secret-id <grafana_secret_name_output> --region $AWS_REGION --query "SecretString" --output text

2. Expose the Grafana Service

Use port-forward to expose the Grafana service.

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n monitoring

3. Login to Grafana:

Open your web browser and navigate to http://localhost:3000.
Login with the username admin and the password retrieved from AWS Secrets Manager.

4. Open the NIM Monitoring Dashboard:

Once logged in, click "Dashboards" on the left sidebar and search "nim"
You can find the Dashboard NVIDIA NIM Monitoring from the list
Click and entering to the dashboard.

You should now see the metrics displayed on the Grafana dashboard, allowing you to monitor the performance your NVIDIA NIM service deployment.

info

As of writing this guide, NVIDIA also provides an example Grafana dashboard. You can check it from here.

Conclusion

This blueprint showcases how to deploy and scale large language models like Meta’s Llama 3.1 8B Instruct efficiently on Amazon EKS using the NVIDIA NIM Operator.

By combining OpenAI-compatible APIs with GPU-accelerated inference, declarative Kubernetes CRDs (NIMCache, NIMService), and fast model startup via EFS-based caching, you get a streamlined, production-grade model deployment experience.

Key Benefits:

Faster cold starts through shared, persistent model cache
Declarative and repeatable deployments with CRDs
Dynamic GPU autoscaling powered by Karpenter
One-click infrastructure provisioning using Terraform

In just ~20 minutes, you can go from zero to a scalable LLM service on Kubernetes — ready to serve real-world prompts with low latency and high efficiency.

Cleanup

To tear down the deployed model and associated infrastructure:

Step 1: Delete Model Resources

Delete the deployed NIMService and NIMCache objects from your cluster:

cd blueprints/inference/gpu/nvidia-nim-operator-llama3-8b

kubectl delete -f nim-service-llama3-8b-instruct.yaml
kubectl delete -f nim-cache-llama3-8b-instruct.yaml

Verify deletion:

kubectl get nimservices.apps.nvidia.com -n nim-service
kubectl get nimcaches.apps.nvidia.com -n nim-service

Step 2: Destroy AWS Infrastructure

Navigate back to the root Terraform module and run the cleanup script. This will destroy all AWS resources created for this blueprint, including the VPC, EKS cluster, EFS, and node groups:

cd ai-on-eks/infra/nvidia-nim
./cleanup.sh

What is NVIDIA NIM?​

NVIDIA NIM Operator for Kubernetes​

NIMCache – Model Caching for Faster Load Times​

NIMService – Deploying and Managing the Model Server​

NIMPipeline​

Overview of this deployment pattern on Amazon EKS​

Deploying the Solution​

Prerequisites​

Deploy​

Deploy llama-3.1-8b-instruct with NIM Operator​

Step 1: Create Secrets for Authentication​

Step 2: Cache the Model to EFS using NIMCache CRD​

Step 3: Deploy the Model using NIMService CRD​

🚀 Model Startup Timeline​

Test the Model with a Prompt​

Step 1: Port Forward the Model Service​

Step 2: Send a Sample Prompt Using curl​

Open WebUI Deployment​

Performance Testing with NVIDIA GenAI-Perf Tool​

Grafana Dashboard​

Conclusion​

Key Benefits:​

Cleanup​

Step 1: Delete Model Resources​

Step 2: Destroy AWS Infrastructure​