Dynamic Resource Allocation for GPUs on Amazon EKS
π TL;DR β Dynamic GPU Scheduling with DRA on EKS
DRA is the next-generation GPU scheduling approach in Kubernetes. Dynamic Resource Allocation (DRA) provides advanced GPU management capabilities beyond traditional device plugins. Here's what matters:
DRA Advantages over Traditional GPU Schedulingβ
- π― Fine-grained resource control β Request specific GPU memory amounts, not just whole devices
- π Per-workload sharing strategies β Choose
mps
,time-slicing
,mig
, orexclusive
per pod, not cluster-wide - π§ Topology-aware scheduling β Understands NVLink, IMEX, and GPU interconnects for multi-GPU workloads
- β‘ Advanced GPU features β Required for Amazon EC2 P6e-GB200 UltraServers IMEX, Multi-Node NVLink, and next-gen GPU capabilities
- π€ Coexistence-friendly β Can run alongside traditional device plugins during transition
Key Implementation Details:
- EKS v1.33 β DRA feature gates enabled in EKS-optimized configurations
- For detailed DRA implementation β See Kubernetes DRA documentation
- Node provisioning compatibility:
- Managed Node Groups β Full DRA support π―
- Self-Managed Node Groups β DRA support (requires manual configuration) π§
- Karpenter β DRA support in development (Issue #1231) ποΈ
- Coexistence β Traditional device plugin and DRA can run simultaneously
Why Managed/Self-Managed Node Groups vs Karpenter for DRA?β
- Managed/Self-Managed Node Groups β Full DRA support, optimized for Capacity Block Reservations
- Karpenter β DRA support in development, dynamic scaling conflicts with reserved GPU capacity
- EKS-optimized AMIs β Come with pre-installed NVIDIA drivers
Can I Use Both Traditional GPU Allocation and DRA Together?β
- Coexistence supported β Both can run simultaneously on the same cluster
- DRA is the future β NVIDIA and Kubernetes moving exclusively to DRA
- Migration strategy β Use DRA for new workloads, traditional for existing production
Production Readinessβ
- Technology Preview β GPU allocation and sharing features actively developed by NVIDIA
- Production Ready β ComputeDomains for Multi-Node NVLink fully supported
- Scheduling overhead β Additional latency due to claim resolution process
- General Availability β Expected in Kubernetes v1.34 (2025)
- Latest status updates β Follow NVIDIA DRA Driver GitHub for current development progress
For comprehensive guidance on AI/ML workloads on EKS, see the AWS EKS Best Practices for AI/ML Compute.
Enterprise GPU Utilization Crisis
Despite high demand, enterprise AI platforms consistently waste over half their GPU resources due to scheduling limitations. This represents millions in infrastructure costs.
Even in high-demand AI clusters, GPU utilization frequently remains below 40%. This isn't a configuration issue β it's a fundamental limitation of how Kubernetes abstracts GPU resources. Organizations are paying premium prices for GPU instances while letting the majority of compute power sit idle.
The GPU Scheduling Challenge in Kubernetesβ
Current State: Traditional GPU Allocationβ
Kubernetes has rapidly evolved into the de facto standard for orchestrating AI/ML workloads across enterprise environments, with Amazon EKS emerging as the leading platform for managing GPU-accelerated infrastructure at scale. Organizations are running everything from small inference services to massive distributed training jobs on EKS clusters, leveraging GPU instances like P4d, P5, and the latest P6 series to power their machine learning pipelines.
However, despite Kubernetes' sophistication in managing containerized workloads, the traditional GPU scheduling model remains surprisingly primitive and creates significant operational challenges. The current approach treats GPUs as simple, atomic resources that can only be allocated in whole units, fundamentally mismatched with the diverse and evolving needs of modern AI workloads.
How Traditional GPU Scheduling Works:
- Pods request GPUs using simple integer values:
nvidia.com/gpu: 1
- Scheduler treats GPUs as opaque, indivisible resources
- Each workload gets exclusive access to entire GPU devices
- No awareness of actual resource requirements or GPU topology
The Problem with This Approach: Modern AI workloads have diverse requirements that don't fit this binary model:
- Small inference jobs need only 2-4GB GPU memory but get allocated entire 80GB A100s
- Large training jobs require coordinated multi-GPU communication via NVLink or IMEX
- Mixed workloads could share GPUs efficiently but are forced into separate devices
The GPU Utilization Crisisβ
Even in high-demand clusters, GPU utilization frequently remains below 40%. This isn't a configuration issue: it's a fundamental limitation of how Kubernetes abstracts GPU resources.
Common symptoms of inefficient GPU allocation:
- Queue starvation - Small inference jobs wait behind long-running training tasks
- Resource fragmentation - GPU memory is stranded in unusable chunks across nodes
- Topology blindness - Multi-GPU jobs get suboptimal placement, degrading NVLink performance
- Cost explosion - Organizations overprovision GPUs to work around scheduling inefficiencies
Enter Dynamic Resource Allocation (DRA)β
What DRA Changesβ
Dynamic Resource Allocation fundamentally transforms GPU scheduling in Kubernetes from a rigid, device-centric model to a flexible, workload-aware approach:
Traditional Approach:
resources:
limits:
nvidia.com/gpu: 1 # Get entire GPU, no customization
DRA Approach:
resourceClaims:
- name: gpu-claim
source:
resourceClaimTemplateName: gpu-template # Detailed requirements
See examples section below for ResourceClaimTemplate configurations.
Critical: ResourceClaims must exist in the same namespace as the Pods that reference them. Cross-namespace resource claims are not supported.
Key DRA Innovationsβ
Fine-grained Resource Control
- Request specific GPU memory amounts (e.g., 16Gi out of 80Gi available)
- Specify compute requirements independent of memory needs
- Define topology constraints for multi-GPU workloads
Note: ResourceClaims and Pods must be in the same namespace
Per-Workload Sharing Strategies
MPS - Concurrent small workloads with memory isolation
Time-slicing - Workloads with different peak usage patterns
MIG - Hardware-level isolation in multi-tenant environments
Exclusive - Performance-critical training jobs
Topology-Aware Scheduling
- Understands NVLink connections between GPUs
- Leverages IMEX for Amazon EC2 P6e-GB200 UltraServer clusters
- Optimizes placement for distributed training workloads
Future-Proof Architecture
- Required for next-generation systems like Amazon EC2 P6e-GB200 UltraServers
- Enables advanced features like Multi-Node NVLink
- Supports emerging GPU architectures and sharing technologies
Understanding IMEX, ComputeDomains, and Amazon EC2 P6e-GB200 Multi-Node Schedulingβ
IMEX (NVIDIA Internode Memory Exchange/Management Service) is NVIDIA's orchestration service for GPU memory sharing across NVLink multi-node deployments. In Amazon EC2 P6e-GB200 UltraServer configurations, IMEX coordinates memory export and import operations between nodes, enabling direct GPU-to-GPU memory access across multiple compute nodes for massive AI model training with billions of parameters.
ComputeDomains represent logical groupings of interconnected GPUs that can communicate efficiently through high-bandwidth connections like NVLink or IMEX. DRA uses ComputeDomains to understand GPU topology and ensure workloads requiring multi-GPU coordination are scheduled on appropriately connected hardware.
Amazon EC2 P6e-GB200 Multi-Node Scheduling leverages DRA's topology awareness to coordinate workloads across multiple superchip nodes. Traditional GPU scheduling cannot understand these complex interconnect relationships, making DRA essential for optimal placement of distributed training jobs on Amazon EC2 P6e-GB200 UltraServer systems where proper GPU topology selection directly impacts training performance.
For detailed configuration examples and implementation guidance, see the AWS EKS AI/ML Best Practices documentation.
Implementation Considerations for EKSβ
Now that we understand DRA's capabilities and advanced features like IMEX and ComputeDomains, let's explore the practical considerations for implementing DRA on Amazon EKS. The following sections address key decisions around node provisioning, migration strategies, and EKS-specific configurations that will determine your DRA deployment success.
Managed Node Groups vs Karpenter for P-Series GPU Instances and DRAβ
The choice between node provisioning methods for DRA isn't just about technical compatibility. It's fundamentally about how GPU capacity is purchased and utilized in enterprise AI workloads. Managed and Self-Managed Node Groups are currently the recommended approach for DRA because they align with the economics and operational patterns of high-end GPU instances.
Here's why: The majority of large GPU instances (P4d (A100), P5 (H100), P6 with B200, and P6e with GB200) are primarily available through AWS Capacity Block Reservations rather than on-demand pricing. When organizations purchase Capacity Blocks, they commit to paying for every second of GPU time until the reservation expires, regardless of whether the GPUs are actively utilized. This creates a fundamental mismatch with Karpenter's core value proposition of dynamic scaling based on workload demand. Spinning nodes down during low-demand periods doesn't save money. It actually wastes the reserved capacity you're already paying for.
Additionally, Karpenter doesn't yet support DRA scheduling (Issue #1231 tracks active development), making it incompatible with production DRA workloads. While Karpenter excels at cost optimization through dynamic scaling for general compute workloads, Capacity Block reservations require an "always-on" utilization strategy to maximize ROI: exactly what Managed Node Groups provide with their static capacity model.
The future picture is more optimistic: Karpenter's roadmap includes static node features that would make it suitable for Capacity Block scenarios. The community is actively working on manual node provisioning without workloads and static provisioning capabilities through RFCs like static provisioning and manual node provisioning. Once DRA support is added alongside these static provisioning capabilities, Karpenter could become the preferred choice for DRA workloads with Capacity Block ML reserved instances. Until then, Managed Node Groups with EKS-optimized AMIs (which come with pre-installed NVIDIA drivers) provide the most reliable foundation for DRA implementations.
DRA and Traditional GPU Allocation Coexistenceβ
Yes, but with careful configuration to avoid conflicts. DRA and traditional GPU allocation can coexist on the same cluster, but this requires thoughtful setup to prevent resource double-allocation issues. NVIDIA's DRA driver is designed as an additional component alongside the GPU Operator, with selective enablement to avoid conflicts.
The recommended approach for gradual migration: Configure the NVIDIA DRA driver to enable only specific subsystems initially. For example, you can set resources.gpus.enabled=false
to use traditional device plugins for GPU allocation while enabling DRA's ComputeDomain subsystem for Multi-Node NVLink capabilities. This allows teams to gain operational experience with DRA's advanced features without risking established GPU allocation workflows.
Key considerations for coexistence:
- Avoid same-device conflicts: DRA and device plugins should not manage the same GPU devices simultaneously
- Selective component enablement: Use NVIDIA DRA driver's modular design to enable features gradually
- Node selector management: Configure node selectors carefully to prevent resource allocation conflicts
- Technology Preview status: GPU allocation and sharing features are in Technology Preview (check NVIDIA DRA Driver GitHub for updates)
For migration planning, start with DRA's production-ready features like ComputeDomains for Multi-Node NVLink, while keeping traditional device plugins for core GPU allocation. Once DRA's GPU allocation reaches full support, gradually migrate workloads starting with development and inference services before moving mission-critical training jobs. NVIDIA and the Kubernetes community have designed DRA as the eventual replacement for device plugins, but the transition requires careful orchestration to maintain cluster stability.
Visual Comparison: Traditional vs DRAβ
The diagram below illustrates how DRA fundamentally changes the scheduling flow:
- Traditional Model: The pod directly requests an entire GPU via the node resource model. Scheduling and allocation are static, with no room for partial usage or workload intent.
- DRA Model: Pods express intent via templates; claims are dynamically generated and resolved with the help of a DRA-aware scheduler and device driver. Multiple workloads can share GPUs safely and efficiently, maximizing utilization.
Technical Capabilities Comparisonβ
nvidia.com/gpu: 1
ResourceClaimTemplate
How DRA Actually Works: The Complete Technical Flowβ
Dynamic Resource Allocation (DRA) extends Kubernetes scheduling with a modular, pluggable mechanism for handling GPU and other device resources. Rather than allocating integer units of opaque hardware, DRA introduces ResourceClaims
, ResourceClaimTemplates
, DeviceClasses
, and ResourceSlices
to express, match, and provision device requirements at runtime.
Step-by-step DRA Workflowβ
DRA fundamentally changes how Kubernetes manages GPU resources through sophisticated orchestration:
1. Resource Discovery and Advertisementβ
When NVIDIA DRA driver starts, it discovers available GPUs on each node and creates ResourceSlices that advertise device capabilities to the Kubernetes API server.
2. DeviceClass Registrationβ
The driver registers one or more DeviceClass
objects to logically group GPU resources:
gpu.nvidia.com
: Standard GPU resourcesmig.nvidia.com
: Multi-Instance GPU partitionscompute-domain.nvidia.com
: Cross-node GPU coordination
3. Resource Claim Creationβ
ResourceClaimTemplates generate individual ResourceClaims for each pod, specifying:
- Specific GPU memory requirements
- Sharing strategy (MPS, time-slicing, exclusive)
- Driver versions and compute capabilities
- Topology constraints for multi-GPU workloads
4. Intelligent Schedulingβ
The DRA-aware scheduler evaluates pending ResourceClaims
and queries available ResourceSlices
across nodes:
- Matches device properties and constraints using CEL expressions
- Ensures sharing strategy compatibility with other running pods
- Selects optimal nodes considering topology, availability, and policy
5. Dynamic Allocationβ
On the selected node, the DRA driver:
- Sets up device access for the container (e.g., mounts MIG instance or configures MPS)
- Allocates shared vs. exclusive access as per claim configuration
- Isolates GPU slices securely between concurrent workloads
Deploying the Solutionβ
π In this example, you will provision JARK Cluster on Amazon EKS with DRA support
Prerequisites
Install required tools and dependencies
Deploy
Configure and run JARK stack installation
Verify
Test your DRA deployment and validate functionality
Prerequisitesβ
Ensure that you have installed the following tools on your machine:
- AWS CLI - AWS Command Line Interface
- kubectl - Kubernetes command-line tool
- terraform - Infrastructure as Code tool
Deployβ
1. Clone the repository:β
git clone https://github.com/awslabs/ai-on-eks.git
If you are using a profile for authentication, set your export AWS_PROFILE="<PROFILE_name>"
to the desired profile name
2. Review and customize configurations:β
- Check available addons in
infra/base/terraform/variables.tf
- Modify addon settings in
infra/jark-stack/terraform/blueprint.tfvars
as needed - Update the AWS region in
blueprint.tfvars
Enable DRA Components:
In the blueprint.tfvars
file, uncomment the following lines:
enable_nvidia_dra_driver = true
enable_nvidia_gpu_operator = true
The NVIDIA GPU Operator includes all necessary components:
- NVIDIA Device Plugin
- DCGM Exporter
- MIG Manager
- GPU Feature Discovery
- Node Feature Discovery
The NVIDIA DRA Driver is deployed as a separate Helm chart parallel to the GPU Operator.
Prerequisites
Install required tools and dependencies
Deploy
Configure and run JARK stack installation
Verify
Test your DRA deployment and validate functionality
3. Navigate to the deployment directory and run the install script:β
cd ai-on-eks/infra/jark-stack && chmod +x install.sh
./install.sh
This script will automatically provision and configure the following components:
- Amazon EKS Cluster with DRA (Dynamic Resource Allocation) feature gates enabled.
- Two GPU-managed node groups using Amazon Linux 2023 GPU AMIs:
- G6 Node Group: Intended for testing MPS and time-slicing strategies.
- P4d(e) Node Group: Intended for testing MIG-based GPU partitioning.
β οΈ Both node groups are initialized with zero nodes to avoid unnecessary cost.
- To test MPS/time-slicing, manually update the
g6
node groupβsmin_size
anddesired_size
via the EKS console. - To test MIG, you need at least one
p4d
orp4de
instance, which requires a Capacity Block Reservation (CBR). Edit the file:infra/base/terraform/eks.tf
. Set your actualcapacity_reservation_id
and change themin_size
for the MIG node group to1
Prerequisites
Install required tools and dependencies
Deploy
Configure and run JARK stack installation
Verify
Test your DRA deployment and validate functionality
4. Verify Deploymentβ
Follow these verification steps to ensure your DRA deployment is working correctly:
Step 1: Configure kubectl access
Update your local kubeconfig to access the Kubernetes cluster:
aws eks update-kubeconfig --name jark-stack # Replace with your EKS cluster name
Step 2: Verify worker nodes
First, let's verify that worker nodes are running in the cluster:
kubectl get nodes
Expected output: You should see two x86 instances from the core node group, plus any GPU instances (g6, p4d, etc.) that you manually scaled up via the EKS console.
Step 3: Verify DRA components
Run this command to verify all deployments, including the NVIDIA GPU Operator and NVIDIA DRA Driver:
kubectl get deployments -A
Expected output: All pods should be in Running
state before proceeding to test the examples below.
Instance compatibility for testing:
- Time-slicing and MPS: Any G5 or G6 instance
- MIG partitioning: P-series instances (P4d or higher)
- IMEX use cases: P6e-GB200 UltraServers
Once all components are running, you can start testing the various DRA examples mentioned in the following sections.
Component Architectureβ
The NVIDIA DRA Driver runs as an independent Helm chart parallel to the NVIDIA GPU Operator, not as part of it. Both components work together to provide comprehensive GPU management capabilities.
GPU Sharing Strategies: Technical Deep Diveβ
Understanding GPU sharing technologies is crucial for optimizing resource utilization. Each strategy provides different benefits and addresses specific use cases.
- π Basic Allocation
- β Time-Slicing
- π Multi-Process Service (MPS)
- ποΈ Multi-Instance GPU (MIG)
Basic GPU Allocationβ
Standard GPU allocation without sharing - each workload gets exclusive access to a complete GPU. This is the traditional model that provides maximum performance isolation.
How to Deploy Basic Allocation:
- ResourceClaimTemplate
- Basic Pod
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test1
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test1
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test1
name: gpu-pod
labels:
app: pod
spec:
containers:
- name: ctr0
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: gpu0
resourceClaims:
- name: gpu0
resourceClaimTemplateName: single-gpu
nodeSelector:
NodeGroupType: g6-mng
nvidia.com/gpu.present: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Deploy the Example:
kubectl apply -f basic-gpu-claim-template.yaml
kubectl apply -f basic-gpu-pod.yaml
kubectl get pods -n gpu-test1 -w
Best For:
- Large model training requiring full GPU resources
- Workloads that fully utilize GPU compute and memory
- Applications requiring maximum performance isolation
- Legacy applications not designed for GPU sharing
What is Time-Slicing?β
Time-slicing is a GPU sharing mechanism where multiple workloads take turns using the GPU, with each getting exclusive access during their allocated time slice. This approach is similar to CPU time-sharing but applied to GPU resources.
Technical Implementation:
- The GPU scheduler allocates specific time windows (typically 1-10ms) to each workload
- During a workload's time slice, it has complete access to GPU compute and memory
- Context switching occurs between time slices, saving and restoring GPU state
- No memory isolation between workloads - they share the same GPU memory space
Key Characteristics:
- Temporal Isolation: Workloads are isolated in time but not in memory
- Full GPU Access: Each workload gets complete GPU resources during its slice
- Context Switching Overhead: Small performance penalty for switching between workloads
- Flexible Allocation: Time slice duration can be adjusted based on workload requirements
How to Deploy Time-Slicing with DRAβ
- ResourceClaimTemplate
- Pod Configuration
apiVersion: v1
kind: Namespace
metadata:
name: timeslicing-gpu
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
name: timeslicing-gpu-template
namespace: timeslicing-gpu
spec:
spec:
devices:
requests:
- name: shared-gpu
deviceClassName: gpu.nvidia.com
config:
- requests: ["shared-gpu"]
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
kind: GpuConfig
sharing:
strategy: TimeSlicing
# ConfigMap containing Python scripts for timeslicing pods
apiVersion: v1
kind: ConfigMap
metadata:
name: timeslicing-scripts-configmap
namespace: timeslicing-gpu
data:
inference-script.py: |
import torch
import time
import os
print(f"=== POD 1 STARTING ===")
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
device = torch.cuda.current_device()
print(f"Current GPU: {torch.cuda.get_device_name(device)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
# Simulate inference workload
for i in range(20):
x = torch.randn(1000, 1000).cuda()
y = torch.mm(x, x.t())
print(f"Pod 1 - Iteration {i+1} completed at {time.strftime('%H:%M:%S')}")
time.sleep(5)
else:
print("No GPU available!")
time.sleep(60)
training-script.py: |
import torch
import time
import os
print(f"=== POD 2 STARTING ===")
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
device = torch.cuda.current_device()
print(f"Current GPU: {torch.cuda.get_device_name(device)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
# Simulate training workload with heavier compute
for i in range(15):
x = torch.randn(2000, 2000).cuda()
y = torch.mm(x, x.t())
loss = torch.sum(y)
print(f"Pod 2 - Training step {i+1}, Loss: {loss.item():.2f} at {time.strftime('%H:%M:%S')}")
time.sleep(5)
else:
print("No GPU available!")
time.sleep(60)
---
# Pod 1 - Inference workload
apiVersion: v1
kind: Pod
metadata:
name: inference-pod-1
namespace: timeslicing-gpu
labels:
app: gpu-inference
spec:
restartPolicy: Never
containers:
- name: inference-container
image: nvcr.io/nvidia/pytorch:25.04-py3
command: ["python", "/scripts/inference-script.py"]
volumeMounts:
- name: script-volume
mountPath: /scripts
readOnly: true
resources:
claims:
- name: shared-gpu-claim
resourceClaims:
- name: shared-gpu-claim
resourceClaimTemplateName: timeslicing-gpu-template
nodeSelector:
NodeGroupType: g6-mng
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: script-volume
configMap:
name: timeslicing-scripts-configmap
defaultMode: 0755
---
# Pod 2 - Training workload
apiVersion: v1
kind: Pod
metadata:
name: training-pod-2
namespace: timeslicing-gpu
labels:
app: gpu-training
spec:
restartPolicy: Never
containers:
- name: training-container
image: nvcr.io/nvidia/pytorch:25.04-py3
command: ["python", "/scripts/training-script.py"]
volumeMounts:
- name: script-volume
mountPath: /scripts
readOnly: true
resources:
claims:
- name: shared-gpu-claim-2
resourceClaims:
- name: shared-gpu-claim-2
resourceClaimTemplateName: timeslicing-gpu-template
nodeSelector:
NodeGroupType: g6-mng
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: script-volume
configMap:
name: timeslicing-scripts-configmap
defaultMode: 0755
Deploy the Example:
kubectl apply -f timeslicing-claim-template.yaml
kubectl apply -f timeslicing-pod.yaml
kubectl get pods -n timeslicing-gpu -w
Best For:
- Inference workloads with sporadic GPU usage
- Development and testing environments
- Workloads with different peak usage times
- Applications that don't require memory isolation
No memory or fault isolation between workloads. One workload can affect others through memory exhaustion or GPU errors.
What is MPS?β
NVIDIA Multi-Process Service (MPS) is a GPU sharing technology that allows multiple CUDA applications to run concurrently on the same GPU by creating a daemon that manages GPU access and enables spatial sharing of GPU resources.
Technical Implementation:
- MPS daemon acts as a proxy between CUDA applications and the GPU driver
- Each process gets dedicated GPU memory allocation
- Compute kernels from different processes can execute simultaneously when resources allow
- Memory isolation is maintained between processes
- Hardware scheduling enables true parallel execution
Key Characteristics:
- Spatial Isolation: GPU compute units can be shared simultaneously
- Memory Isolation: Each process has dedicated memory space
- Concurrent Execution: Multiple kernels can run in parallel
- Lower Latency: Reduced context switching compared to time-slicing
How to Deploy MPS with DRAβ
- ResourceClaimTemplate
- Multi-Container Pod
apiVersion: v1
kind: Namespace
metadata:
name: mps-gpu
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
name: mps-gpu-template
namespace: mps-gpu
spec:
spec:
devices:
requests:
- name: shared-gpu
deviceClassName: gpu.nvidia.com
config:
- requests: ["shared-gpu"]
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
kind: GpuConfig
sharing:
strategy: MPS
# ConfigMap containing Python scripts for MPS pods
apiVersion: v1
kind: ConfigMap
metadata:
name: mps-scripts-configmap
namespace: mps-gpu
data:
inference-script.py: |
import torch
import torch.nn as nn
import time
import os
print(f"=== INFERENCE CONTAINER STARTING ===")
print(f"Process ID: {os.getpid()}")
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
device = torch.cuda.current_device()
print(f"Current GPU: {torch.cuda.get_device_name(device)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
# Create inference model
model = nn.Sequential(
nn.Linear(1000, 500),
nn.ReLU(),
nn.Linear(500, 100)
).cuda()
# Run inference
for i in range(1, 999999):
with torch.no_grad():
x = torch.randn(128, 1000).cuda()
output = model(x)
result = torch.sum(output)
print(f"Inference Container PID {os.getpid()}: Batch {i}, Result: {result.item():.2f} at {time.strftime('%H:%M:%S')}")
time.sleep(2)
else:
print("No GPU available!")
time.sleep(60)
training-script.py: |
import torch
import torch.nn as nn
import time
import os
print(f"=== TRAINING CONTAINER STARTING ===")
print(f"Process ID: {os.getpid()}")
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
device = torch.cuda.current_device()
print(f"Current GPU: {torch.cuda.get_device_name(device)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
# Create training model
model = nn.Sequential(
nn.Linear(2000, 1000),
nn.ReLU(),
nn.Linear(1000, 500),
nn.ReLU(),
nn.Linear(500, 10)
).cuda()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Run training
for epoch in range(1, 999999):
x = torch.randn(64, 2000).cuda()
target = torch.randn(64, 10).cuda()
optimizer.zero_grad()
output = model(x)
loss = criterion(output, target)
loss.backward()
optimizer.step()
print(f"Training Container PID {os.getpid()}: Epoch {epoch}, Loss: {loss.item():.4f} at {time.strftime('%H:%M:%S')}")
time.sleep(3)
else:
print("No GPU available!")
time.sleep(60)
---
# Single Pod with Multiple Containers sharing GPU via MPS
apiVersion: v1
kind: Pod
metadata:
name: mps-multi-container-pod
namespace: mps-gpu
labels:
app: mps-demo
spec:
restartPolicy: Never
containers:
# Container 1 - Inference workload
- name: inference-container
image: nvcr.io/nvidia/pytorch:25.04-py3
command: ["python", "/scripts/inference-script.py"]
volumeMounts:
- name: script-volume
mountPath: /scripts
readOnly: true
resources:
claims:
- name: shared-gpu-claim
request: shared-gpu
# Container 2 - Training workload
- name: training-container
image: nvcr.io/nvidia/pytorch:25.04-py3
command: ["python", "/scripts/training-script.py"]
volumeMounts:
- name: script-volume
mountPath: /scripts
readOnly: true
resources:
claims:
- name: shared-gpu-claim
request: shared-gpu
resourceClaims:
- name: shared-gpu-claim
resourceClaimTemplateName: mps-gpu-template
nodeSelector:
NodeGroupType: g6-mng
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: script-volume
configMap:
name: mps-scripts-configmap
defaultMode: 0755
Deploy the Example:
kubectl apply -f mps-claim-template.yaml
kubectl apply -f mps-pod.yaml
kubectl get pods -n mps-gpu -w
Best For:
- Multiple small inference workloads
- Concurrent model serving scenarios
- Workloads using less than 50% of GPU compute
- Applications requiring memory isolation
MPS eliminates context switching overhead and enables true parallelism. Ideal for workloads using less than 50% of GPU compute capacity.
What is MIG?β
Multi-Instance GPU (MIG) is a hardware-level GPU partitioning technology available on NVIDIA A100, H100, and newer GPUs that creates smaller, isolated GPU instances with dedicated compute units, memory, and memory bandwidth.
Technical Implementation:
- Hardware-level partitioning creates separate GPU instances
- Each MIG instance has dedicated streaming multiprocessors (SMs)
- Memory and memory bandwidth are physically partitioned
- Complete fault isolation between instances
- Independent scheduling and execution contexts
Key Characteristics:
- Hardware Isolation: Physical separation of compute and memory resources
- Fault Isolation: Issues in one instance don't affect others
- Predictable Performance: Guaranteed resources for each instance
- Fixed Partitioning: Predefined MIG profiles (1g.5gb, 2g.10gb, etc.)
How to Deploy MIG with DRAβ
- ResourceClaimTemplate
- MIG Pod
apiVersion: v1
kind: Namespace
metadata:
name: mig-gpu
---
# Template for 3g.40gb MIG instance (Large training)
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
name: mig-large-template
namespace: mig-gpu
spec:
spec:
devices:
requests:
- name: mig-large
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: |
device.attributes['gpu.nvidia.com'].profile == '3g.40gb'
---
# Template for 2g.20gb MIG instance (Medium training)
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
name: mig-medium-template
namespace: mig-gpu
spec:
spec:
devices:
requests:
- name: mig-medium
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: |
device.attributes['gpu.nvidia.com'].profile == '2g.20gb'
---
# Template for 1g.10gb MIG instance (Small inference)
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
name: mig-small-template
namespace: mig-gpu
spec:
spec:
devices:
requests:
- name: mig-small
deviceClassName: mig.nvidia.com
selectors:
- cel:
expression: |
device.attributes['gpu.nvidia.com'].profile == '1g.10gb'
# ConfigMap containing Python scripts for MIG pods
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-scripts-configmap
namespace: mig-gpu
data:
large-training-script.py: |
import torch
import torch.nn as nn
import torch.optim as optim
import time
import os
print(f"=== LARGE TRAINING POD (3g.40gb) ===")
print(f"Process ID: {os.getpid()}")
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
device = torch.cuda.current_device()
print(f"Using GPU: {torch.cuda.get_device_name(device)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")
# Large model for 3g.40gb instance
model = nn.Sequential(
nn.Linear(2048, 1024),
nn.ReLU(),
nn.Linear(1024, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 10)
).cuda()
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
# Training loop
for epoch in range(100):
# Large batch for 3g.40gb
x = torch.randn(256, 2048).cuda()
y = torch.randint(0, 10, (256,)).cuda()
optimizer.zero_grad()
output = model(x)
loss = criterion(output, y)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Large Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
time.sleep(3)
print("Large training completed on 3g.40gb MIG instance")
medium-training-script.py: |
import torch
import torch.nn as nn
import torch.optim as optim
import time
import os
print(f"=== MEDIUM TRAINING POD (2g.20gb) ===")
print(f"Process ID: {os.getpid()}")
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
device = torch.cuda.current_device()
print(f"Using GPU: {torch.cuda.get_device_name(device)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")
# Medium model for 2g.20gb instance
model = nn.Sequential(
nn.Linear(1024, 512),
nn.ReLU(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 10)
).cuda()
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
# Training loop
for epoch in range(100):
# Medium batch for 2g.20gb
x = torch.randn(128, 1024).cuda()
y = torch.randint(0, 10, (128,)).cuda()
optimizer.zero_grad()
output = model(x)
loss = criterion(output, y)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Medium Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
time.sleep(4)
print("Medium training completed on 2g.20gb MIG instance")
small-inference-script.py: |
import torch
import torch.nn as nn
import time
import os
print(f"=== SMALL INFERENCE POD (1g.10gb) ===")
print(f"Process ID: {os.getpid()}")
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
if torch.cuda.is_available():
device = torch.cuda.current_device()
print(f"Using GPU: {torch.cuda.get_device_name(device)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")
# Small model for 1g.10gb instance
model = nn.Sequential(
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 10)
).cuda()
print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")
# Inference loop
for i in range(200):
with torch.no_grad():
# Small batch for 1g.10gb
x = torch.randn(32, 512).cuda()
output = model(x)
prediction = torch.argmax(output, dim=1)
if i % 20 == 0:
print(f"Small Inference - Batch {i}, Predictions: {prediction[:5].tolist()}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
time.sleep(2)
print("Small inference completed on 1g.10gb MIG instance")
---
# Pod 1: Large training workload (3g.40gb)
apiVersion: v1
kind: Pod
metadata:
name: mig-large-training-pod
namespace: mig-gpu
labels:
app: mig-large-training
workload-type: training
spec:
restartPolicy: Never
containers:
- name: large-training-container
image: nvcr.io/nvidia/pytorch:25.04-py3
command: ["python", "/scripts/large-training-script.py"]
volumeMounts:
- name: script-volume
mountPath: /scripts
readOnly: true
resources:
claims:
- name: mig-large-claim
resourceClaims:
- name: mig-large-claim
resourceClaimTemplateName: mig-large-template
nodeSelector:
node.kubernetes.io/instance-type: p4de.24xlarge
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: script-volume
configMap:
name: mig-scripts-configmap
defaultMode: 0755
---
# Pod 2: Medium training workload (2g.20gb) - can run on SAME GPU as Pod 1
apiVersion: v1
kind: Pod
metadata:
name: mig-medium-training-pod
namespace: mig-gpu
labels:
app: mig-medium-training
workload-type: training
spec:
restartPolicy: Never
containers:
- name: medium-training-container
image: nvcr.io/nvidia/pytorch:25.04-py3
command: ["python", "/scripts/medium-training-script.py"]
volumeMounts:
- name: script-volume
mountPath: /scripts
readOnly: true
resources:
claims:
- name: mig-medium-claim
resourceClaims:
- name: mig-medium-claim
resourceClaimTemplateName: mig-medium-template
nodeSelector:
node.kubernetes.io/instance-type: p4de.24xlarge
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: script-volume
configMap:
name: mig-scripts-configmap
defaultMode: 0755
---
# Pod 3: Small inference workload (1g.10gb) - can run on SAME GPU as Pod 1 & 2
apiVersion: v1
kind: Pod
metadata:
name: mig-small-inference-pod
namespace: mig-gpu
labels:
app: mig-small-inference
workload-type: inference
spec:
restartPolicy: Never
containers:
- name: small-inference-container
image: nvcr.io/nvidia/pytorch:25.04-py3
command: ["python", "/scripts/small-inference-script.py"]
volumeMounts:
- name: script-volume
mountPath: /scripts
readOnly: true
resources:
claims:
- name: mig-small-claim
resourceClaims:
- name: mig-small-claim
resourceClaimTemplateName: mig-small-template
nodeSelector:
node.kubernetes.io/instance-type: p4de.24xlarge
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: script-volume
configMap:
name: mig-scripts-configmap
defaultMode: 0755
Deploy the Example:
kubectl apply -f mig-claim-template.yaml
kubectl apply -f mig-pod.yaml
kubectl get pods -n mig-gpu -w
Best For:
- Multi-tenant environments requiring strict isolation
- Predictable performance requirements
- Production workloads requiring guaranteed resources
- Compliance scenarios requiring hardware-level isolation
- Hardware-level partitioning creates isolated GPU instances
- Each MIG instance has dedicated compute units and memory
- Complete fault isolation between instances
- Requires GPU Operator with MIG Manager for dynamic reconfiguration
Strategy Selection Guideβ
Workload Type | Recommended Strategy | Key Benefit |
---|---|---|
Small Inference Jobs | Time-slicing or MPS | Higher GPU utilization |
Concurrent Small Models | MPS | True parallelism |
Production Multi-tenant | MIG | Hardware isolation |
Large Model Training | Basic Allocation | Maximum performance |
Development/Testing | Time-slicing | Flexibility and simplicity |
Cleanupβ
Removing DRA Componentsβ
- 1οΈβ£ Clean Up DRA Examples
- 2οΈβ£ JARK Stack Cleanup
Remove all DRA example workloads:
# Delete all pods first to ensure proper cleanup
kubectl delete pod inference-pod-1 -n timeslicing-gpu --ignore-not-found
kubectl delete pod training-pod-2 -n timeslicing-gpu --ignore-not-found
kubectl delete pod mps-workload -n mps-gpu --ignore-not-found
kubectl delete pod mig-workload -n mig-gpu --ignore-not-found
kubectl delete pod basic-gpu-pod -n gpu-test1 --ignore-not-found
# Delete ResourceClaimTemplates
kubectl delete resourceclaimtemplate timeslicing-gpu-template -n timeslicing-gpu --ignore-not-found
kubectl delete resourceclaimtemplate mps-gpu-template -n mps-gpu --ignore-not-found
kubectl delete resourceclaimtemplate mig-gpu-template -n mig-gpu --ignore-not-found
kubectl delete resourceclaimtemplate basic-gpu-template -n gpu-test1 --ignore-not-found
# Delete any remaining ResourceClaims
kubectl delete resourceclaims --all --all-namespaces --ignore-not-found
# Delete ConfigMaps (contain scripts)
kubectl delete configmap timeslicing-scripts-configmap -n timeslicing-gpu --ignore-not-found
# Finally delete namespaces
kubectl delete namespace timeslicing-gpu --ignore-not-found
kubectl delete namespace mps-gpu --ignore-not-found
kubectl delete namespace mig-gpu --ignore-not-found
kubectl delete namespace gpu-test1 --ignore-not-found
# Verify cleanup
kubectl get resourceclaims --all-namespaces
kubectl get resourceclaimtemplates --all-namespaces
For JARK-deployed clusters, use the automated cleanup:
# Navigate to JARK directory
cd ai-on-eks/infra/jark-stack/terraform/_LOCAL
# Run the cleanup script
chmod +x cleanup.sh
./cleanup.sh
# Alternative: Manual terraform destroy
# terraform destroy -var-file=terraform/blueprint.tfvars -auto-approve
This will remove the entire EKS cluster and all associated resources. Ensure you have backed up any important data before proceeding.
π§ Troubleshooting Common Issues
- π Pods Stuck in Pending
- β οΈ GPU Sharing Conflicts
- π Performance Issues
Issue: Pods with ResourceClaims stuck in Pending state
Diagnosis:
# Check ResourceClaim status
kubectl get resourceclaims --all-namespaces -o wide
# Check DRA driver logs
kubectl logs -n gpu-operator -l app=nvidia-dra-driver --tail=100
# Verify DeviceClasses exist
kubectl get deviceclasses
Resolution:
# Restart DRA driver pods
kubectl delete pods -n gpu-operator -l app=nvidia-dra-driver
# Check node GPU availability
kubectl describe nodes | grep -A 10 "Allocatable"
Issue: Incompatible sharing strategies on same GPU
Diagnosis:
# Check ResourceSlice allocation
kubectl get resourceslices -o yaml
# Verify current allocations
kubectl get resourceclaims --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.status.allocation.deviceResults[*].device}{"\n"}{end}'
Resolution:
# Remove conflicting ResourceClaims
kubectl delete resourceclaim <conflicting-claim> -n <namespace>
# Wait for resource cleanup
kubectl wait --for=delete resourceclaim <claim-name> -n <namespace> --timeout=60s
Issue: Suboptimal GPU utilization or performance
Monitoring:
# Check GPU utilization
kubectl exec -it <gpu-pod> -n <namespace> -- nvidia-smi
# Monitor ResourceClaim allocation
kubectl get events --field-selector reason=ResourceClaimAllocated --sort-by='.lastTimestamp'
# Check sharing strategy effectiveness
kubectl logs <workload-pod> -n <namespace> | grep -i gpu
Optimization:
- Review sharing strategy selection (MPS vs time-slicing vs exclusive)
- Validate workload resource requirements match allocation
- Consider MIG partitioning for predictable isolation
Conclusionβ
Dynamic Resource Allocation represents a fundamental shift from rigid GPU allocation to intelligent, workload-aware resource management. By leveraging structured ResourceClaims and vendor-specific drivers, DRA unlocks the GPU utilization rates necessary for cost-effective AI/ML operations at enterprise scale.
With the simplified JARK-based deployment approach, organizations can implement production-grade DRA capabilities in three steps, transforming their GPU infrastructure from a static resource pool into a dynamic, intelligent platform optimized for modern AI workloads.
The combination of EKS's managed infrastructure, NVIDIA's driver ecosystem, and Kubernetes' declarative model creates a powerful foundation for next-generation AI workloads - from small inference jobs to multi-node distributed training on GB200 superchips.