Skip to main content

Dynamic Resource Allocation for GPUs on Amazon EKS

πŸš€ TL;DR – Dynamic GPU Scheduling with DRA on EKS

DRA is the next-generation GPU scheduling approach in Kubernetes. Dynamic Resource Allocation (DRA) provides advanced GPU management capabilities beyond traditional device plugins. Here's what matters:

DRA Advantages over Traditional GPU Scheduling​

  • 🎯 Fine-grained resource control – Request specific GPU memory amounts, not just whole devices
  • πŸ”„ Per-workload sharing strategies – Choose mps, time-slicing, mig, or exclusive per pod, not cluster-wide
  • 🧠 Topology-aware scheduling – Understands NVLink, IMEX, and GPU interconnects for multi-GPU workloads
  • ⚑ Advanced GPU features – Required for Amazon EC2 P6e-GB200 UltraServers IMEX, Multi-Node NVLink, and next-gen GPU capabilities
  • 🀝 Coexistence-friendly – Can run alongside traditional device plugins during transition
Amazon EC2 P6e-GB200 UltraServer Requirement
  • Traditional scheduling unsupported – Amazon EC2 P6e-GB200 UltraServers require DRA and won't work with NVIDIA device plugin + kube-scheduler
  • DRA mandatory – Multi-Node NVLink and IMEX capabilities only available through DRA

Key Implementation Details:

☸️
EKS Control Plane
v1.33+
DRA feature gates enabled
πŸ–₯️
EKS Optimized NVIDIA AMI
Latest AMI
Pre-installed drivers
πŸ”—
Managed Node Groups
Full DRA Support
Recommended approach
πŸ”§
Self-Managed Nodegroups
DRA Support
Manual configuration
πŸ› οΈ
NVIDIA GPU Operator
v25.3.0+
Required for DRA
⚑
NVIDIA DRA Driver
v25.3.0+
Core DRA functionality
🚧
Karpenter DRA Support
In Development
GitHub Issue #1231
πŸ”¬
DRA Status
Beta (K8s v1.32+)
Technology Preview
  • EKS v1.33 – DRA feature gates enabled in EKS-optimized configurations
  • For detailed DRA implementation – See Kubernetes DRA documentation
  • Node provisioning compatibility:
    • Managed Node Groups – Full DRA support 🎯
    • Self-Managed Node Groups – DRA support (requires manual configuration) πŸ”§
    • Karpenter – DRA support in development (Issue #1231) πŸ—οΈ
  • Coexistence – Traditional device plugin and DRA can run simultaneously

Why Managed/Self-Managed Node Groups vs Karpenter for DRA?​

  • Managed/Self-Managed Node Groups – Full DRA support, optimized for Capacity Block Reservations
  • Karpenter – DRA support in development, dynamic scaling conflicts with reserved GPU capacity
  • EKS-optimized AMIs – Come with pre-installed NVIDIA drivers

Can I Use Both Traditional GPU Allocation and DRA Together?​

  • Coexistence supported – Both can run simultaneously on the same cluster
  • DRA is the future – NVIDIA and Kubernetes moving exclusively to DRA
  • Migration strategy – Use DRA for new workloads, traditional for existing production

Production Readiness​

  • Technology Preview – GPU allocation and sharing features actively developed by NVIDIA
  • Production Ready – ComputeDomains for Multi-Node NVLink fully supported
  • Scheduling overhead – Additional latency due to claim resolution process
  • General Availability – Expected in Kubernetes v1.34 (2025)
  • Latest status updates – Follow NVIDIA DRA Driver GitHub for current development progress
Additional Resources

For comprehensive guidance on AI/ML workloads on EKS, see the AWS EKS Best Practices for AI/ML Compute.

πŸ’Έ

Enterprise GPU Utilization Crisis

60%GPU capacity wasted

Despite high demand, enterprise AI platforms consistently waste over half their GPU resources due to scheduling limitations. This represents millions in infrastructure costs.

Even in high-demand AI clusters, GPU utilization frequently remains below 40%. This isn't a configuration issue β€” it's a fundamental limitation of how Kubernetes abstracts GPU resources. Organizations are paying premium prices for GPU instances while letting the majority of compute power sit idle.

πŸŽ›οΈ

The GPU Scheduling Challenge in Kubernetes​

Current State: Traditional GPU Allocation​

Kubernetes has rapidly evolved into the de facto standard for orchestrating AI/ML workloads across enterprise environments, with Amazon EKS emerging as the leading platform for managing GPU-accelerated infrastructure at scale. Organizations are running everything from small inference services to massive distributed training jobs on EKS clusters, leveraging GPU instances like P4d, P5, and the latest P6 series to power their machine learning pipelines.

However, despite Kubernetes' sophistication in managing containerized workloads, the traditional GPU scheduling model remains surprisingly primitive and creates significant operational challenges. The current approach treats GPUs as simple, atomic resources that can only be allocated in whole units, fundamentally mismatched with the diverse and evolving needs of modern AI workloads.

How Traditional GPU Scheduling Works:

  • Pods request GPUs using simple integer values: nvidia.com/gpu: 1
  • Scheduler treats GPUs as opaque, indivisible resources
  • Each workload gets exclusive access to entire GPU devices
  • No awareness of actual resource requirements or GPU topology

The Problem with This Approach: Modern AI workloads have diverse requirements that don't fit this binary model:

  • Small inference jobs need only 2-4GB GPU memory but get allocated entire 80GB A100s
  • Large training jobs require coordinated multi-GPU communication via NVLink or IMEX
  • Mixed workloads could share GPUs efficiently but are forced into separate devices

The GPU Utilization Crisis​

Critical Inefficiency in Production

Even in high-demand clusters, GPU utilization frequently remains below 40%. This isn't a configuration issue: it's a fundamental limitation of how Kubernetes abstracts GPU resources.

Common symptoms of inefficient GPU allocation:

  • Queue starvation - Small inference jobs wait behind long-running training tasks
  • Resource fragmentation - GPU memory is stranded in unusable chunks across nodes
  • Topology blindness - Multi-GPU jobs get suboptimal placement, degrading NVLink performance
  • Cost explosion - Organizations overprovision GPUs to work around scheduling inefficiencies
πŸ’Ž

Enter Dynamic Resource Allocation (DRA)​

What DRA Changes​

Dynamic Resource Allocation fundamentally transforms GPU scheduling in Kubernetes from a rigid, device-centric model to a flexible, workload-aware approach:

Traditional Approach:

resources:
limits:
nvidia.com/gpu: 1 # Get entire GPU, no customization

DRA Approach:

resourceClaims:
- name: gpu-claim
source:
resourceClaimTemplateName: gpu-template # Detailed requirements

See examples section below for ResourceClaimTemplate configurations.

Namespace Requirement

Critical: ResourceClaims must exist in the same namespace as the Pods that reference them. Cross-namespace resource claims are not supported.

Key DRA Innovations​

🎯

Fine-grained Resource Control

  • Request specific GPU memory amounts (e.g., 16Gi out of 80Gi available)
  • Specify compute requirements independent of memory needs
  • Define topology constraints for multi-GPU workloads

Note: ResourceClaims and Pods must be in the same namespace

πŸ”„

Per-Workload Sharing Strategies

MPS - Concurrent small workloads with memory isolation

Time-slicing - Workloads with different peak usage patterns

MIG - Hardware-level isolation in multi-tenant environments

Exclusive - Performance-critical training jobs

🌐

Topology-Aware Scheduling

πŸš€

Future-Proof Architecture

Understanding IMEX, ComputeDomains, and Amazon EC2 P6e-GB200 Multi-Node Scheduling​

IMEX (NVIDIA Internode Memory Exchange/Management Service) is NVIDIA's orchestration service for GPU memory sharing across NVLink multi-node deployments. In Amazon EC2 P6e-GB200 UltraServer configurations, IMEX coordinates memory export and import operations between nodes, enabling direct GPU-to-GPU memory access across multiple compute nodes for massive AI model training with billions of parameters.

ComputeDomains represent logical groupings of interconnected GPUs that can communicate efficiently through high-bandwidth connections like NVLink or IMEX. DRA uses ComputeDomains to understand GPU topology and ensure workloads requiring multi-GPU coordination are scheduled on appropriately connected hardware.

Amazon EC2 P6e-GB200 Multi-Node Scheduling leverages DRA's topology awareness to coordinate workloads across multiple superchip nodes. Traditional GPU scheduling cannot understand these complex interconnect relationships, making DRA essential for optimal placement of distributed training jobs on Amazon EC2 P6e-GB200 UltraServer systems where proper GPU topology selection directly impacts training performance.

For detailed configuration examples and implementation guidance, see the AWS EKS AI/ML Best Practices documentation.

Implementation Considerations for EKS​

Now that we understand DRA's capabilities and advanced features like IMEX and ComputeDomains, let's explore the practical considerations for implementing DRA on Amazon EKS. The following sections address key decisions around node provisioning, migration strategies, and EKS-specific configurations that will determine your DRA deployment success.

Managed Node Groups vs Karpenter for P-Series GPU Instances and DRA​

The choice between node provisioning methods for DRA isn't just about technical compatibility. It's fundamentally about how GPU capacity is purchased and utilized in enterprise AI workloads. Managed and Self-Managed Node Groups are currently the recommended approach for DRA because they align with the economics and operational patterns of high-end GPU instances.

Here's why: The majority of large GPU instances (P4d (A100), P5 (H100), P6 with B200, and P6e with GB200) are primarily available through AWS Capacity Block Reservations rather than on-demand pricing. When organizations purchase Capacity Blocks, they commit to paying for every second of GPU time until the reservation expires, regardless of whether the GPUs are actively utilized. This creates a fundamental mismatch with Karpenter's core value proposition of dynamic scaling based on workload demand. Spinning nodes down during low-demand periods doesn't save money. It actually wastes the reserved capacity you're already paying for.

Additionally, Karpenter doesn't yet support DRA scheduling (Issue #1231 tracks active development), making it incompatible with production DRA workloads. While Karpenter excels at cost optimization through dynamic scaling for general compute workloads, Capacity Block reservations require an "always-on" utilization strategy to maximize ROI: exactly what Managed Node Groups provide with their static capacity model.

The future picture is more optimistic: Karpenter's roadmap includes static node features that would make it suitable for Capacity Block scenarios. The community is actively working on manual node provisioning without workloads and static provisioning capabilities through RFCs like static provisioning and manual node provisioning. Once DRA support is added alongside these static provisioning capabilities, Karpenter could become the preferred choice for DRA workloads with Capacity Block ML reserved instances. Until then, Managed Node Groups with EKS-optimized AMIs (which come with pre-installed NVIDIA drivers) provide the most reliable foundation for DRA implementations.

DRA and Traditional GPU Allocation Coexistence​

Yes, but with careful configuration to avoid conflicts. DRA and traditional GPU allocation can coexist on the same cluster, but this requires thoughtful setup to prevent resource double-allocation issues. NVIDIA's DRA driver is designed as an additional component alongside the GPU Operator, with selective enablement to avoid conflicts.

The recommended approach for gradual migration: Configure the NVIDIA DRA driver to enable only specific subsystems initially. For example, you can set resources.gpus.enabled=false to use traditional device plugins for GPU allocation while enabling DRA's ComputeDomain subsystem for Multi-Node NVLink capabilities. This allows teams to gain operational experience with DRA's advanced features without risking established GPU allocation workflows.

Key considerations for coexistence:

  • Avoid same-device conflicts: DRA and device plugins should not manage the same GPU devices simultaneously
  • Selective component enablement: Use NVIDIA DRA driver's modular design to enable features gradually
  • Node selector management: Configure node selectors carefully to prevent resource allocation conflicts
  • Technology Preview status: GPU allocation and sharing features are in Technology Preview (check NVIDIA DRA Driver GitHub for updates)

For migration planning, start with DRA's production-ready features like ComputeDomains for Multi-Node NVLink, while keeping traditional device plugins for core GPU allocation. Once DRA's GPU allocation reaches full support, gradually migrate workloads starting with development and inference services before moving mission-critical training jobs. NVIDIA and the Kubernetes community have designed DRA as the eventual replacement for device plugins, but the transition requires careful orchestration to maintain cluster stability.

Visual Comparison: Traditional vs DRA​

The diagram below illustrates how DRA fundamentally changes the scheduling flow:

  • Traditional Model: The pod directly requests an entire GPU via the node resource model. Scheduling and allocation are static, with no room for partial usage or workload intent.
  • DRA Model: Pods express intent via templates; claims are dynamically generated and resolved with the help of a DRA-aware scheduler and device driver. Multiple workloads can share GPUs safely and efficiently, maximizing utilization.

Technical Capabilities Comparison​

Capability
πŸ”΄ Traditional Device Plugin
🟒 Dynamic Resource Allocation (DRA)
Resource Request Model
❌
Simple integers
nvidia.com/gpu: 1
βœ…
Structured claims via
ResourceClaimTemplate
GPU Memory Specification
❌
All-or-nothing allocation
βœ…
Memory-based constraints and selectors
Sharing Configuration
⚠️
Static cluster-wide ConfigMaps
βœ…
Per-workload sharing strategies
Multi-GPU Topology Awareness
❌
No topology coordination
βœ…
DeviceClass selectors for NVLink, IMEX
Runtime Reconfiguration
❌
Requires pod deletion and redeployment
βœ…
Dynamic reallocation without restarts
MIG Support
⚠️
Limited - static partitions, manual setup
βœ…
Full MIG profiles via dynamic claims
βš™οΈ

How DRA Actually Works: The Complete Technical Flow​

Dynamic Resource Allocation (DRA) extends Kubernetes scheduling with a modular, pluggable mechanism for handling GPU and other device resources. Rather than allocating integer units of opaque hardware, DRA introduces ResourceClaims, ResourceClaimTemplates, DeviceClasses, and ResourceSlices to express, match, and provision device requirements at runtime.

Step-by-step DRA Workflow​

DRA fundamentally changes how Kubernetes manages GPU resources through sophisticated orchestration:

1. Resource Discovery and Advertisement​

When NVIDIA DRA driver starts, it discovers available GPUs on each node and creates ResourceSlices that advertise device capabilities to the Kubernetes API server.

2. DeviceClass Registration​

The driver registers one or more DeviceClass objects to logically group GPU resources:

  • gpu.nvidia.com: Standard GPU resources
  • mig.nvidia.com: Multi-Instance GPU partitions
  • compute-domain.nvidia.com: Cross-node GPU coordination

3. Resource Claim Creation​

ResourceClaimTemplates generate individual ResourceClaims for each pod, specifying:

  • Specific GPU memory requirements
  • Sharing strategy (MPS, time-slicing, exclusive)
  • Driver versions and compute capabilities
  • Topology constraints for multi-GPU workloads

4. Intelligent Scheduling​

The DRA-aware scheduler evaluates pending ResourceClaims and queries available ResourceSlices across nodes:

  • Matches device properties and constraints using CEL expressions
  • Ensures sharing strategy compatibility with other running pods
  • Selects optimal nodes considering topology, availability, and policy

5. Dynamic Allocation​

On the selected node, the DRA driver:

  • Sets up device access for the container (e.g., mounts MIG instance or configures MPS)
  • Allocates shared vs. exclusive access as per claim configuration
  • Isolates GPU slices securely between concurrent workloads

Deploying the Solution​

πŸ‘‡ In this example, you will provision JARK Cluster on Amazon EKS with DRA support
1

Prerequisites

Install required tools and dependencies

2

Deploy

Configure and run JARK stack installation

3

Verify

Test your DRA deployment and validate functionality

Prerequisites​

Ensure that you have installed the following tools on your machine:

  • AWS CLI - AWS Command Line Interface
  • kubectl - Kubernetes command-line tool
  • terraform - Infrastructure as Code tool

Deploy​

1. Clone the repository:​

Clone the repository
git clone https://github.com/awslabs/ai-on-eks.git
Authentication Profile

If you are using a profile for authentication, set your export AWS_PROFILE="<PROFILE_name>" to the desired profile name

2. Review and customize configurations:​

  • Check available addons in infra/base/terraform/variables.tf
  • Modify addon settings in infra/jark-stack/terraform/blueprint.tfvars as needed
  • Update the AWS region in blueprint.tfvars

Enable DRA Components:

In the blueprint.tfvars file, uncomment the following lines:

blueprint.tfvars
enable_nvidia_dra_driver         = true
enable_nvidia_gpu_operator = true
Automated Setup

The NVIDIA GPU Operator includes all necessary components:

  • NVIDIA Device Plugin
  • DCGM Exporter
  • MIG Manager
  • GPU Feature Discovery
  • Node Feature Discovery

The NVIDIA DRA Driver is deployed as a separate Helm chart parallel to the GPU Operator.

1

Prerequisites

Install required tools and dependencies

2

Deploy

Configure and run JARK stack installation

3

Verify

Test your DRA deployment and validate functionality

3. Navigate to the deployment directory and run the install script:​

Deploy JARK Stack with DRA
cd ai-on-eks/infra/jark-stack && chmod +x install.sh
./install.sh

This script will automatically provision and configure the following components:

  • Amazon EKS Cluster with DRA (Dynamic Resource Allocation) feature gates enabled.
  • Two GPU-managed node groups using Amazon Linux 2023 GPU AMIs:
  • G6 Node Group: Intended for testing MPS and time-slicing strategies.
  • P4d(e) Node Group: Intended for testing MIG-based GPU partitioning.

⚠️ Both node groups are initialized with zero nodes to avoid unnecessary cost.

  • To test MPS/time-slicing, manually update the g6 node group’s min_size and desired_size via the EKS console.
  • To test MIG, you need at least one p4d or p4de instance, which requires a Capacity Block Reservation (CBR). Edit the file: infra/base/terraform/eks.tf. Set your actual capacity_reservation_id and change the min_size for the MIG node group to 1
1

Prerequisites

Install required tools and dependencies

2

Deploy

Configure and run JARK stack installation

3

Verify

Test your DRA deployment and validate functionality

4. Verify Deployment​

Follow these verification steps to ensure your DRA deployment is working correctly:

Step 1: Configure kubectl access

Update your local kubeconfig to access the Kubernetes cluster:

aws eks update-kubeconfig --name jark-stack  # Replace with your EKS cluster name

Step 2: Verify worker nodes

First, let's verify that worker nodes are running in the cluster:

kubectl get nodes

Expected output: You should see two x86 instances from the core node group, plus any GPU instances (g6, p4d, etc.) that you manually scaled up via the EKS console.

Step 3: Verify DRA components

Run this command to verify all deployments, including the NVIDIA GPU Operator and NVIDIA DRA Driver:

kubectl get deployments -A

Expected output: All pods should be in Running state before proceeding to test the examples below.

Instance compatibility for testing:

  • Time-slicing and MPS: Any G5 or G6 instance
  • MIG partitioning: P-series instances (P4d or higher)
  • IMEX use cases: P6e-GB200 UltraServers

Once all components are running, you can start testing the various DRA examples mentioned in the following sections.

Component Architecture​

NVIDIA Tools

The NVIDIA DRA Driver runs as an independent Helm chart parallel to the NVIDIA GPU Operator, not as part of it. Both components work together to provide comprehensive GPU management capabilities.

🎲

GPU Sharing Strategies: Technical Deep Dive​

Understanding GPU sharing technologies is crucial for optimizing resource utilization. Each strategy provides different benefits and addresses specific use cases.

Basic GPU Allocation​

Standard GPU allocation without sharing - each workload gets exclusive access to a complete GPU. This is the traditional model that provides maximum performance isolation.

How to Deploy Basic Allocation:

basic-gpu-claim-template.yaml
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test1
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test1
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: gpu.nvidia.com

Deploy the Example:

Deploy Basic GPU Allocation
kubectl apply -f basic-gpu-claim-template.yaml
kubectl apply -f basic-gpu-pod.yaml
kubectl get pods -n gpu-test1 -w

Best For:

  • Large model training requiring full GPU resources
  • Workloads that fully utilize GPU compute and memory
  • Applications requiring maximum performance isolation
  • Legacy applications not designed for GPU sharing

Strategy Selection Guide​

Workload TypeRecommended StrategyKey Benefit
Small Inference JobsTime-slicing or MPSHigher GPU utilization
Concurrent Small ModelsMPSTrue parallelism
Production Multi-tenantMIGHardware isolation
Large Model TrainingBasic AllocationMaximum performance
Development/TestingTime-slicingFlexibility and simplicity

Cleanup​

Removing DRA Components​

Remove all DRA example workloads:

Clean up DRA workloads
# Delete all pods first to ensure proper cleanup
kubectl delete pod inference-pod-1 -n timeslicing-gpu --ignore-not-found
kubectl delete pod training-pod-2 -n timeslicing-gpu --ignore-not-found
kubectl delete pod mps-workload -n mps-gpu --ignore-not-found
kubectl delete pod mig-workload -n mig-gpu --ignore-not-found
kubectl delete pod basic-gpu-pod -n gpu-test1 --ignore-not-found

# Delete ResourceClaimTemplates
kubectl delete resourceclaimtemplate timeslicing-gpu-template -n timeslicing-gpu --ignore-not-found
kubectl delete resourceclaimtemplate mps-gpu-template -n mps-gpu --ignore-not-found
kubectl delete resourceclaimtemplate mig-gpu-template -n mig-gpu --ignore-not-found
kubectl delete resourceclaimtemplate basic-gpu-template -n gpu-test1 --ignore-not-found

# Delete any remaining ResourceClaims
kubectl delete resourceclaims --all --all-namespaces --ignore-not-found

# Delete ConfigMaps (contain scripts)
kubectl delete configmap timeslicing-scripts-configmap -n timeslicing-gpu --ignore-not-found

# Finally delete namespaces
kubectl delete namespace timeslicing-gpu --ignore-not-found
kubectl delete namespace mps-gpu --ignore-not-found
kubectl delete namespace mig-gpu --ignore-not-found
kubectl delete namespace gpu-test1 --ignore-not-found

# Verify cleanup
kubectl get resourceclaims --all-namespaces
kubectl get resourceclaimtemplates --all-namespaces
πŸ”§ Troubleshooting Common Issues

Issue: Pods with ResourceClaims stuck in Pending state

Diagnosis:

# Check ResourceClaim status
kubectl get resourceclaims --all-namespaces -o wide

# Check DRA driver logs
kubectl logs -n gpu-operator -l app=nvidia-dra-driver --tail=100

# Verify DeviceClasses exist
kubectl get deviceclasses

Resolution:

# Restart DRA driver pods
kubectl delete pods -n gpu-operator -l app=nvidia-dra-driver

# Check node GPU availability
kubectl describe nodes | grep -A 10 "Allocatable"

Conclusion​

Dynamic Resource Allocation represents a fundamental shift from rigid GPU allocation to intelligent, workload-aware resource management. By leveraging structured ResourceClaims and vendor-specific drivers, DRA unlocks the GPU utilization rates necessary for cost-effective AI/ML operations at enterprise scale.

πŸš€ Ready to Transform Your GPU Infrastructure?

With the simplified JARK-based deployment approach, organizations can implement production-grade DRA capabilities in three steps, transforming their GPU infrastructure from a static resource pool into a dynamic, intelligent platform optimized for modern AI workloads.

The combination of EKS's managed infrastructure, NVIDIA's driver ecosystem, and Kubernetes' declarative model creates a powerful foundation for next-generation AI workloads - from small inference jobs to multi-node distributed training on GB200 superchips.