Dynamic Resource Allocation for GPUs on Amazon EKS

🚀 TL;DR – Dynamic GPU Scheduling with DRA on EKS

DRA is the next-generation GPU scheduling approach in Kubernetes. Dynamic Resource Allocation (DRA) provides advanced GPU management capabilities beyond traditional device plugins. Here's what matters:

DRA Advantages over Traditional GPU Scheduling

🎯 Fine-grained resource control – Request specific GPU memory amounts, not just whole devices
🔄 Per-workload sharing strategies – Choose mps, time-slicing, mig, or exclusive per pod, not cluster-wide
🧠 Topology-aware scheduling – Understands NVLink, IMEX, and GPU interconnects for multi-GPU workloads
⚡ Advanced GPU features – Required for Amazon EC2 P6e-GB200 UltraServers IMEX, Multi-Node NVLink, and next-gen GPU capabilities
🤝 Coexistence-friendly – Can run alongside traditional device plugins during transition

Amazon EC2 P6e-GB200 UltraServer Requirement

Traditional scheduling unsupported – Amazon EC2 P6e-GB200 UltraServers require DRA and won't work with NVIDIA device plugin + kube-scheduler
DRA mandatory – Multi-Node NVLink and IMEX capabilities only available through DRA

Key Implementation Details:

☸️

EKS Control Plane

v1.33+

DRA feature gates enabled

🖥️

EKS Optimized NVIDIA AMI

Latest AMI

Pre-installed drivers

🔗

Managed Node Groups

Full DRA Support

Recommended approach

🔧

Self-Managed Nodegroups

DRA Support

Manual configuration

🛠️

NVIDIA GPU Operator

v25.3.0+

Required for DRA

⚡

NVIDIA DRA Driver

v25.3.0+

Core DRA functionality

🚧

Karpenter DRA Support

In Development

GitHub Issue #1231

🔬

DRA Status

Beta (K8s v1.32+)

Technology Preview

EKS v1.33 – DRA feature gates enabled in EKS-optimized configurations
For detailed DRA implementation – See Kubernetes DRA documentation
Node provisioning compatibility:
- Managed Node Groups – Full DRA support 🎯
- Self-Managed Node Groups – DRA support (requires manual configuration) 🔧
- Karpenter – DRA support in development (Issue #1231) 🏗️
Coexistence – Traditional device plugin and DRA can run simultaneously

Why Managed/Self-Managed Node Groups vs Karpenter for DRA?

Managed/Self-Managed Node Groups – Full DRA support, optimized for Capacity Block Reservations
Karpenter – DRA support in development, dynamic scaling conflicts with reserved GPU capacity
EKS-optimized AMIs – Come with pre-installed NVIDIA drivers

Can I Use Both Traditional GPU Allocation and DRA Together?

Coexistence supported – Both can run simultaneously on the same cluster
DRA is the future – NVIDIA and Kubernetes moving exclusively to DRA
Migration strategy – Use DRA for new workloads, traditional for existing production

Production Readiness

Technology Preview – GPU allocation and sharing features actively developed by NVIDIA
Production Ready – ComputeDomains for Multi-Node NVLink fully supported
Scheduling overhead – Additional latency due to claim resolution process
General Availability – Expected in Kubernetes v1.34 (2025)
Latest status updates – Follow NVIDIA DRA Driver GitHub for current development progress

Additional Resources

For comprehensive guidance on AI/ML workloads on EKS, see the AWS EKS Best Practices for AI/ML Compute.

💸

Enterprise GPU Utilization Crisis

60%GPU capacity wasted

Despite high demand, enterprise AI platforms consistently waste over half their GPU resources due to scheduling limitations. This represents millions in infrastructure costs.

Even in high-demand AI clusters, GPU utilization frequently remains below 40%. This isn't a configuration issue — it's a fundamental limitation of how Kubernetes abstracts GPU resources. Organizations are paying premium prices for GPU instances while letting the majority of compute power sit idle.

🎛️

The GPU Scheduling Challenge in Kubernetes

Current State: Traditional GPU Allocation

Kubernetes has rapidly evolved into the de facto standard for orchestrating AI/ML workloads across enterprise environments, with Amazon EKS emerging as the leading platform for managing GPU-accelerated infrastructure at scale. Organizations are running everything from small inference services to massive distributed training jobs on EKS clusters, leveraging GPU instances like P4d, P5, and the latest P6 series to power their machine learning pipelines.

However, despite Kubernetes' sophistication in managing containerized workloads, the traditional GPU scheduling model remains surprisingly primitive and creates significant operational challenges. The current approach treats GPUs as simple, atomic resources that can only be allocated in whole units, fundamentally mismatched with the diverse and evolving needs of modern AI workloads.

How Traditional GPU Scheduling Works:

Pods request GPUs using simple integer values: nvidia.com/gpu: 1
Scheduler treats GPUs as opaque, indivisible resources
Each workload gets exclusive access to entire GPU devices
No awareness of actual resource requirements or GPU topology

The Problem with This Approach: Modern AI workloads have diverse requirements that don't fit this binary model:

Small inference jobs need only 2-4GB GPU memory but get allocated entire 80GB A100s
Large training jobs require coordinated multi-GPU communication via NVLink or IMEX
Mixed workloads could share GPUs efficiently but are forced into separate devices

The GPU Utilization Crisis

Critical Inefficiency in Production

Even in high-demand clusters, GPU utilization frequently remains below 40%. This isn't a configuration issue: it's a fundamental limitation of how Kubernetes abstracts GPU resources.

Common symptoms of inefficient GPU allocation:

Queue starvation - Small inference jobs wait behind long-running training tasks
Resource fragmentation - GPU memory is stranded in unusable chunks across nodes
Topology blindness - Multi-GPU jobs get suboptimal placement, degrading NVLink performance
Cost explosion - Organizations overprovision GPUs to work around scheduling inefficiencies

💎

Enter Dynamic Resource Allocation (DRA)

What DRA Changes

Dynamic Resource Allocation fundamentally transforms GPU scheduling in Kubernetes from a rigid, device-centric model to a flexible, workload-aware approach:

Traditional Approach:

resources:
  limits:
    nvidia.com/gpu: 1  # Get entire GPU, no customization

DRA Approach:

resourceClaims:
- name: gpu-claim
  source:
    resourceClaimTemplateName: gpu-template  # Detailed requirements

See examples section below for ResourceClaimTemplate configurations.

Namespace Requirement

Critical: ResourceClaims must exist in the same namespace as the Pods that reference them. Cross-namespace resource claims are not supported.

Key DRA Innovations

Request specific GPU memory amounts (e.g., 16Gi out of 80Gi available)
Specify compute requirements independent of memory needs
Define topology constraints for multi-GPU workloads

Note: ResourceClaims and Pods must be in the same namespace

MPS - Concurrent small workloads with memory isolation

Time-slicing - Workloads with different peak usage patterns

MIG - Hardware-level isolation in multi-tenant environments

Exclusive - Performance-critical training jobs

Understands NVLink connections between GPUs
Leverages IMEX for Amazon EC2 P6e-GB200 UltraServer clusters
Optimizes placement for distributed training workloads

Required for next-generation systems like Amazon EC2 P6e-GB200 UltraServers
Enables advanced features like Multi-Node NVLink
Supports emerging GPU architectures and sharing technologies

Understanding IMEX, ComputeDomains, and Amazon EC2 P6e-GB200 Multi-Node Scheduling

IMEX (NVIDIA Internode Memory Exchange/Management Service) is NVIDIA's orchestration service for GPU memory sharing across NVLink multi-node deployments. In Amazon EC2 P6e-GB200 UltraServer configurations, IMEX coordinates memory export and import operations between nodes, enabling direct GPU-to-GPU memory access across multiple compute nodes for massive AI model training with billions of parameters.

ComputeDomains represent logical groupings of interconnected GPUs that can communicate efficiently through high-bandwidth connections like NVLink or IMEX. DRA uses ComputeDomains to understand GPU topology and ensure workloads requiring multi-GPU coordination are scheduled on appropriately connected hardware.

Amazon EC2 P6e-GB200 Multi-Node Scheduling leverages DRA's topology awareness to coordinate workloads across multiple superchip nodes. Traditional GPU scheduling cannot understand these complex interconnect relationships, making DRA essential for optimal placement of distributed training jobs on Amazon EC2 P6e-GB200 UltraServer systems where proper GPU topology selection directly impacts training performance.

For detailed configuration examples and implementation guidance, see the AWS EKS AI/ML Best Practices documentation.

Implementation Considerations for EKS

Now that we understand DRA's capabilities and advanced features like IMEX and ComputeDomains, let's explore the practical considerations for implementing DRA on Amazon EKS. The following sections address key decisions around node provisioning, migration strategies, and EKS-specific configurations that will determine your DRA deployment success.

Managed Node Groups vs Karpenter for P-Series GPU Instances and DRA

The choice between node provisioning methods for DRA isn't just about technical compatibility. It's fundamentally about how GPU capacity is purchased and utilized in enterprise AI workloads. Managed and Self-Managed Node Groups are currently the recommended approach for DRA because they align with the economics and operational patterns of high-end GPU instances.

Here's why: The majority of large GPU instances (P4d (A100), P5 (H100), P6 with B200, and P6e with GB200) are primarily available through AWS Capacity Block Reservations rather than on-demand pricing. When organizations purchase Capacity Blocks, they commit to paying for every second of GPU time until the reservation expires, regardless of whether the GPUs are actively utilized. This creates a fundamental mismatch with Karpenter's core value proposition of dynamic scaling based on workload demand. Spinning nodes down during low-demand periods doesn't save money. It actually wastes the reserved capacity you're already paying for.

Additionally, Karpenter doesn't yet support DRA scheduling (Issue #1231 tracks active development), making it incompatible with production DRA workloads. While Karpenter excels at cost optimization through dynamic scaling for general compute workloads, Capacity Block reservations require an "always-on" utilization strategy to maximize ROI: exactly what Managed Node Groups provide with their static capacity model.

The future picture is more optimistic: Karpenter's roadmap includes static node features that would make it suitable for Capacity Block scenarios. The community is actively working on manual node provisioning without workloads and static provisioning capabilities through RFCs like static provisioning and manual node provisioning. Once DRA support is added alongside these static provisioning capabilities, Karpenter could become the preferred choice for DRA workloads with Capacity Block ML reserved instances. Until then, Managed Node Groups with EKS-optimized AMIs (which come with pre-installed NVIDIA drivers) provide the most reliable foundation for DRA implementations.

DRA and Traditional GPU Allocation Coexistence

Yes, but with careful configuration to avoid conflicts. DRA and traditional GPU allocation can coexist on the same cluster, but this requires thoughtful setup to prevent resource double-allocation issues. NVIDIA's DRA driver is designed as an additional component alongside the GPU Operator, with selective enablement to avoid conflicts.

The recommended approach for gradual migration: Configure the NVIDIA DRA driver to enable only specific subsystems initially. For example, you can set resources.gpus.enabled=false to use traditional device plugins for GPU allocation while enabling DRA's ComputeDomain subsystem for Multi-Node NVLink capabilities. This allows teams to gain operational experience with DRA's advanced features without risking established GPU allocation workflows.

Key considerations for coexistence:

Avoid same-device conflicts: DRA and device plugins should not manage the same GPU devices simultaneously
Selective component enablement: Use NVIDIA DRA driver's modular design to enable features gradually
Node selector management: Configure node selectors carefully to prevent resource allocation conflicts
Technology Preview status: GPU allocation and sharing features are in Technology Preview (check NVIDIA DRA Driver GitHub for updates)

For migration planning, start with DRA's production-ready features like ComputeDomains for Multi-Node NVLink, while keeping traditional device plugins for core GPU allocation. Once DRA's GPU allocation reaches full support, gradually migrate workloads starting with development and inference services before moving mission-critical training jobs. NVIDIA and the Kubernetes community have designed DRA as the eventual replacement for device plugins, but the transition requires careful orchestration to maintain cluster stability.

Visual Comparison: Traditional vs DRA

The diagram below illustrates how DRA fundamentally changes the scheduling flow:

Traditional Model: The pod directly requests an entire GPU via the node resource model. Scheduling and allocation are static, with no room for partial usage or workload intent.
DRA Model: Pods express intent via templates; claims are dynamically generated and resolved with the help of a DRA-aware scheduler and device driver. Multiple workloads can share GPUs safely and efficiently, maximizing utilization.

Technical Capabilities Comparison

Capability

🔴 Traditional Device Plugin

🟢 Dynamic Resource Allocation (DRA)

Resource Request Model

❌

Simple integers

nvidia.com/gpu: 1

✅

Structured claims via

ResourceClaimTemplate

GPU Memory Specification

❌

All-or-nothing allocation

✅

Memory-based constraints and selectors

Sharing Configuration

⚠️

Static cluster-wide ConfigMaps

✅

Per-workload sharing strategies

Multi-GPU Topology Awareness

❌

No topology coordination

✅

DeviceClass selectors for NVLink, IMEX

Runtime Reconfiguration

❌

Requires pod deletion and redeployment

✅

Dynamic reallocation without restarts

MIG Support

⚠️

Limited - static partitions, manual setup

✅

Full MIG profiles via dynamic claims

⚙️

How DRA Actually Works: The Complete Technical Flow

Dynamic Resource Allocation (DRA) extends Kubernetes scheduling with a modular, pluggable mechanism for handling GPU and other device resources. Rather than allocating integer units of opaque hardware, DRA introduces ResourceClaims, ResourceClaimTemplates, DeviceClasses, and ResourceSlices to express, match, and provision device requirements at runtime.

Step-by-step DRA Workflow

DRA fundamentally changes how Kubernetes manages GPU resources through sophisticated orchestration:

1. Resource Discovery and Advertisement

When NVIDIA DRA driver starts, it discovers available GPUs on each node and creates ResourceSlices that advertise device capabilities to the Kubernetes API server.

2. DeviceClass Registration

The driver registers one or more DeviceClass objects to logically group GPU resources:

gpu.nvidia.com: Standard GPU resources
mig.nvidia.com: Multi-Instance GPU partitions
compute-domain.nvidia.com: Cross-node GPU coordination

3. Resource Claim Creation

ResourceClaimTemplates generate individual ResourceClaims for each pod, specifying:

Specific GPU memory requirements
Sharing strategy (MPS, time-slicing, exclusive)
Driver versions and compute capabilities
Topology constraints for multi-GPU workloads

4. Intelligent Scheduling

The DRA-aware scheduler evaluates pending ResourceClaims and queries available ResourceSlices across nodes:

Matches device properties and constraints using CEL expressions
Ensures sharing strategy compatibility with other running pods
Selects optimal nodes considering topology, availability, and policy

5. Dynamic Allocation

On the selected node, the DRA driver:

Sets up device access for the container (e.g., mounts MIG instance or configures MPS)
Allocates shared vs. exclusive access as per claim configuration
Isolates GPU slices securely between concurrent workloads

Deploying the Solution

👇 In this example, you will provision JARK Cluster on Amazon EKS with DRA support

Prerequisites

Install required tools and dependencies

Deploy

Configure and run JARK stack installation

Verify

Test your DRA deployment and validate functionality

Prerequisites

Ensure that you have installed the following tools on your machine:

AWS CLI - AWS Command Line Interface
kubectl - Kubernetes command-line tool
terraform - Infrastructure as Code tool

Deploy

1. Clone the repository:

Clone the repository
git clone https://github.com/awslabs/ai-on-eks.git

Authentication Profile

If you are using a profile for authentication, set your export AWS_PROFILE="<PROFILE_name>" to the desired profile name

2. Review and customize configurations:

Check available addons in infra/base/terraform/variables.tf
Modify addon settings in infra/jark-stack/terraform/blueprint.tfvars as needed
Update the AWS region in blueprint.tfvars

Enable DRA Components:

In the blueprint.tfvars file, uncomment the following lines:

blueprint.tfvars
enable_nvidia_dra_driver         = true
enable_nvidia_gpu_operator       = true

Automated Setup

The NVIDIA GPU Operator includes all necessary components:

NVIDIA Device Plugin
DCGM Exporter
MIG Manager
GPU Feature Discovery
Node Feature Discovery

The NVIDIA DRA Driver is deployed as a separate Helm chart parallel to the GPU Operator.

Prerequisites

Install required tools and dependencies

Deploy

Configure and run JARK stack installation

Verify

Test your DRA deployment and validate functionality

3. Navigate to the deployment directory and run the install script:

Deploy JARK Stack with DRA
cd ai-on-eks/infra/jark-stack && chmod +x install.sh
./install.sh

This script will automatically provision and configure the following components:

Amazon EKS Cluster with DRA (Dynamic Resource Allocation) feature gates enabled.
Two GPU-managed node groups using Amazon Linux 2023 GPU AMIs:
G6 Node Group: Intended for testing MPS and time-slicing strategies.
P4d(e) Node Group: Intended for testing MIG-based GPU partitioning.

⚠️ Both node groups are initialized with zero nodes to avoid unnecessary cost.

To test MPS/time-slicing, manually update the g6 node group’s min_size and desired_size via the EKS console.
To test MIG, you need at least one p4d or p4de instance, which requires a Capacity Block Reservation (CBR). Edit the file: infra/base/terraform/eks.tf. Set your actual capacity_reservation_id and change the min_size for the MIG node group to 1

Prerequisites

Install required tools and dependencies

Deploy

Configure and run JARK stack installation

Verify

Test your DRA deployment and validate functionality

4. Verify Deployment

Follow these verification steps to ensure your DRA deployment is working correctly:

Step 1: Configure kubectl access

Update your local kubeconfig to access the Kubernetes cluster:

aws eks update-kubeconfig --name jark-stack  # Replace with your EKS cluster name

Step 2: Verify worker nodes

First, let's verify that worker nodes are running in the cluster:

kubectl get nodes

Expected output: You should see two x86 instances from the core node group, plus any GPU instances (g6, p4d, etc.) that you manually scaled up via the EKS console.

Step 3: Verify DRA components

Run this command to verify all deployments, including the NVIDIA GPU Operator and NVIDIA DRA Driver:

kubectl get deployments -A

Expected output: All pods should be in Running state before proceeding to test the examples below.

Instance compatibility for testing:

Time-slicing and MPS: Any G5 or G6 instance
MIG partitioning: P-series instances (P4d or higher)
IMEX use cases: P6e-GB200 UltraServers

Once all components are running, you can start testing the various DRA examples mentioned in the following sections.

Component Architecture

NVIDIA Tools

The NVIDIA DRA Driver runs as an independent Helm chart parallel to the NVIDIA GPU Operator, not as part of it. Both components work together to provide comprehensive GPU management capabilities.

🎲

Understanding GPU sharing technologies is crucial for optimizing resource utilization. Each strategy provides different benefits and addresses specific use cases.

💎 Basic Allocation
⌛ Time-Slicing
🌊 Multi-Process Service (MPS)
🏗️ Multi-Instance GPU (MIG)

Basic GPU Allocation

Standard GPU allocation without sharing - each workload gets exclusive access to a complete GPU. This is the traditional model that provides maximum performance isolation.

How to Deploy Basic Allocation:

ResourceClaimTemplate
Basic Pod

basic-gpu-claim-template.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-test1
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  namespace: gpu-test1
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com

basic-gpu-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  namespace: gpu-test1
  name: gpu-pod
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: gpu0
  resourceClaims:
  - name: gpu0
    resourceClaimTemplateName: single-gpu
  nodeSelector:
    NodeGroupType: g6-mng
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

Deploy the Example:

Deploy Basic GPU Allocation
kubectl apply -f basic-gpu-claim-template.yaml
kubectl apply -f basic-gpu-pod.yaml
kubectl get pods -n gpu-test1 -w

Best For:

Large model training requiring full GPU resources
Workloads that fully utilize GPU compute and memory
Applications requiring maximum performance isolation
Legacy applications not designed for GPU sharing

What is Time-Slicing?

Time-slicing is a GPU sharing mechanism where multiple workloads take turns using the GPU, with each getting exclusive access during their allocated time slice. This approach is similar to CPU time-sharing but applied to GPU resources.

Technical Implementation:

The GPU scheduler allocates specific time windows (typically 1-10ms) to each workload
During a workload's time slice, it has complete access to GPU compute and memory
Context switching occurs between time slices, saving and restoring GPU state
No memory isolation between workloads - they share the same GPU memory space

Key Characteristics:

Temporal Isolation: Workloads are isolated in time but not in memory
Full GPU Access: Each workload gets complete GPU resources during its slice
Context Switching Overhead: Small performance penalty for switching between workloads
Flexible Allocation: Time slice duration can be adjusted based on workload requirements

How to Deploy Time-Slicing with DRA

ResourceClaimTemplate
Pod Configuration

timeslicing-claim-template.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: timeslicing-gpu
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: timeslicing-gpu-template
  namespace: timeslicing-gpu
spec:
  spec:
    devices:
      requests:
      - name: shared-gpu
        deviceClassName: gpu.nvidia.com
      config:
      - requests: ["shared-gpu"]
        opaque:
          driver: gpu.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            kind: GpuConfig
            sharing:
              strategy: TimeSlicing

timeslicing-pod.yaml
# ConfigMap containing Python scripts for timeslicing pods
apiVersion: v1
kind: ConfigMap
metadata:
  name: timeslicing-scripts-configmap
  namespace: timeslicing-gpu
data:
  inference-script.py: |
    import torch
    import time
    import os
    print(f"=== POD 1 STARTING ===")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")
    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Current GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
        # Simulate inference workload
        for i in range(20):
            x = torch.randn(1000, 1000).cuda()
            y = torch.mm(x, x.t())
            print(f"Pod 1 - Iteration {i+1} completed at {time.strftime('%H:%M:%S')}")
            time.sleep(5)
    else:
        print("No GPU available!")
        time.sleep(60)

  training-script.py: |
    import torch
    import time
    import os
    print(f"=== POD 2 STARTING ===")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")
    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Current GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")
        # Simulate training workload with heavier compute
        for i in range(15):
            x = torch.randn(2000, 2000).cuda()
            y = torch.mm(x, x.t())
            loss = torch.sum(y)
            print(f"Pod 2 - Training step {i+1}, Loss: {loss.item():.2f} at {time.strftime('%H:%M:%S')}")
            time.sleep(5)
    else:
        print("No GPU available!")
        time.sleep(60)
---
# Pod 1 - Inference workload
apiVersion: v1
kind: Pod
metadata:
  name: inference-pod-1
  namespace: timeslicing-gpu
  labels:
    app: gpu-inference
spec:
  restartPolicy: Never
  containers:
  - name: inference-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/inference-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: shared-gpu-claim
  resourceClaims:
  - name: shared-gpu-claim
    resourceClaimTemplateName: timeslicing-gpu-template
  nodeSelector:
    NodeGroupType: g6-mng
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  volumes:
  - name: script-volume
    configMap:
      name: timeslicing-scripts-configmap
      defaultMode: 0755
---
# Pod 2 - Training workload
apiVersion: v1
kind: Pod
metadata:
  name: training-pod-2
  namespace: timeslicing-gpu
  labels:
    app: gpu-training
spec:
  restartPolicy: Never
  containers:
  - name: training-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/training-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: shared-gpu-claim-2
  resourceClaims:
  - name: shared-gpu-claim-2
    resourceClaimTemplateName: timeslicing-gpu-template
  nodeSelector:
    NodeGroupType: g6-mng
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  volumes:
  - name: script-volume
    configMap:
      name: timeslicing-scripts-configmap
      defaultMode: 0755

Deploy the Example:

Deploy Time-Slicing GPU Sharing
kubectl apply -f timeslicing-claim-template.yaml
kubectl apply -f timeslicing-pod.yaml
kubectl get pods -n timeslicing-gpu -w

Best For:

Inference workloads with sporadic GPU usage
Development and testing environments
Workloads with different peak usage times
Applications that don't require memory isolation

Time-Slicing Limitations

No memory or fault isolation between workloads. One workload can affect others through memory exhaustion or GPU errors.

What is MPS?

NVIDIA Multi-Process Service (MPS) is a GPU sharing technology that allows multiple CUDA applications to run concurrently on the same GPU by creating a daemon that manages GPU access and enables spatial sharing of GPU resources.

Technical Implementation:

MPS daemon acts as a proxy between CUDA applications and the GPU driver
Each process gets dedicated GPU memory allocation
Compute kernels from different processes can execute simultaneously when resources allow
Memory isolation is maintained between processes
Hardware scheduling enables true parallel execution

Key Characteristics:

Spatial Isolation: GPU compute units can be shared simultaneously
Memory Isolation: Each process has dedicated memory space
Concurrent Execution: Multiple kernels can run in parallel
Lower Latency: Reduced context switching compared to time-slicing

How to Deploy MPS with DRA

ResourceClaimTemplate
Multi-Container Pod

mps-claim-template.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: mps-gpu
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: mps-gpu-template
  namespace: mps-gpu
spec:
  spec:
    devices:
      requests:
      - name: shared-gpu
        deviceClassName: gpu.nvidia.com
      config:
      - requests: ["shared-gpu"]
        opaque:
          driver: gpu.nvidia.com
          parameters:
            apiVersion: resource.nvidia.com/v1beta1
            kind: GpuConfig
            sharing:
              strategy: MPS

mps-pod.yaml
# ConfigMap containing Python scripts for MPS pods
apiVersion: v1
kind: ConfigMap
metadata:
  name: mps-scripts-configmap
  namespace: mps-gpu
data:
  inference-script.py: |
    import torch
    import torch.nn as nn
    import time
    import os

    print(f"=== INFERENCE CONTAINER STARTING ===")
    print(f"Process ID: {os.getpid()}")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")

    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Current GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")

        # Create inference model
        model = nn.Sequential(
            nn.Linear(1000, 500),
            nn.ReLU(),
            nn.Linear(500, 100)
        ).cuda()

        # Run inference
        for i in range(1, 999999):
            with torch.no_grad():
                x = torch.randn(128, 1000).cuda()
                output = model(x)
                result = torch.sum(output)
                print(f"Inference Container PID {os.getpid()}: Batch {i}, Result: {result.item():.2f} at {time.strftime('%H:%M:%S')}")
            time.sleep(2)
    else:
        print("No GPU available!")
        time.sleep(60)

  training-script.py: |
    import torch
    import torch.nn as nn
    import time
    import os

    print(f"=== TRAINING CONTAINER STARTING ===")
    print(f"Process ID: {os.getpid()}")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")

    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Current GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1024**3:.1f} GB")

        # Create training model
        model = nn.Sequential(
            nn.Linear(2000, 1000),
            nn.ReLU(),
            nn.Linear(1000, 500),
            nn.ReLU(),
            nn.Linear(500, 10)
        ).cuda()

        criterion = nn.MSELoss()
        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

        # Run training
        for epoch in range(1, 999999):
            x = torch.randn(64, 2000).cuda()
            target = torch.randn(64, 10).cuda()

            optimizer.zero_grad()
            output = model(x)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            print(f"Training Container PID {os.getpid()}: Epoch {epoch}, Loss: {loss.item():.4f} at {time.strftime('%H:%M:%S')}")
            time.sleep(3)
    else:
        print("No GPU available!")
        time.sleep(60)
---
# Single Pod with Multiple Containers sharing GPU via MPS
apiVersion: v1
kind: Pod
metadata:
  name: mps-multi-container-pod
  namespace: mps-gpu
  labels:
    app: mps-demo
spec:
  restartPolicy: Never
  containers:
  # Container 1 - Inference workload
  - name: inference-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/inference-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: shared-gpu-claim
        request: shared-gpu
  # Container 2 - Training workload
  - name: training-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/training-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: shared-gpu-claim
        request: shared-gpu
  resourceClaims:
  - name: shared-gpu-claim
    resourceClaimTemplateName: mps-gpu-template
  nodeSelector:
    NodeGroupType: g6-mng
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  volumes:
  - name: script-volume
    configMap:
      name: mps-scripts-configmap
      defaultMode: 0755

Deploy the Example:

Deploy MPS GPU Sharing
kubectl apply -f mps-claim-template.yaml
kubectl apply -f mps-pod.yaml
kubectl get pods -n mps-gpu -w

Best For:

Multiple small inference workloads
Concurrent model serving scenarios
Workloads using less than 50% of GPU compute
Applications requiring memory isolation

MPS Performance Benefits

MPS eliminates context switching overhead and enables true parallelism. Ideal for workloads using less than 50% of GPU compute capacity.

What is MIG?

Multi-Instance GPU (MIG) is a hardware-level GPU partitioning technology available on NVIDIA A100, H100, and newer GPUs that creates smaller, isolated GPU instances with dedicated compute units, memory, and memory bandwidth.

Technical Implementation:

Hardware-level partitioning creates separate GPU instances
Each MIG instance has dedicated streaming multiprocessors (SMs)
Memory and memory bandwidth are physically partitioned
Complete fault isolation between instances
Independent scheduling and execution contexts

Key Characteristics:

Hardware Isolation: Physical separation of compute and memory resources
Fault Isolation: Issues in one instance don't affect others
Predictable Performance: Guaranteed resources for each instance
Fixed Partitioning: Predefined MIG profiles (1g.5gb, 2g.10gb, etc.)

How to Deploy MIG with DRA

ResourceClaimTemplate
MIG Pod

mig-claim-template.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: mig-gpu
---
# Template for 3g.40gb MIG instance (Large training)
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: mig-large-template
  namespace: mig-gpu
spec:
  spec:
    devices:
      requests:
      - name: mig-large
        deviceClassName: mig.nvidia.com
        selectors:
        - cel:
            expression: |
              device.attributes['gpu.nvidia.com'].profile == '3g.40gb'
---
# Template for 2g.20gb MIG instance (Medium training)
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: mig-medium-template
  namespace: mig-gpu
spec:
  spec:
    devices:
      requests:
      - name: mig-medium
        deviceClassName: mig.nvidia.com
        selectors:
        - cel:
            expression: |
              device.attributes['gpu.nvidia.com'].profile == '2g.20gb'
---
# Template for 1g.10gb MIG instance (Small inference)
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: mig-small-template
  namespace: mig-gpu
spec:
  spec:
    devices:
      requests:
      - name: mig-small
        deviceClassName: mig.nvidia.com
        selectors:
        - cel:
            expression: |
              device.attributes['gpu.nvidia.com'].profile == '1g.10gb'

mig-pod.yaml
# ConfigMap containing Python scripts for MIG pods
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-scripts-configmap
  namespace: mig-gpu
data:
  large-training-script.py: |
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import time
    import os

    print(f"=== LARGE TRAINING POD (3g.40gb) ===")
    print(f"Process ID: {os.getpid()}")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")

    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Using GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")

        # Large model for 3g.40gb instance
        model = nn.Sequential(
            nn.Linear(2048, 1024),
            nn.ReLU(),
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        ).cuda()

        optimizer = optim.Adam(model.parameters())
        criterion = nn.CrossEntropyLoss()

        print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")

        # Training loop
        for epoch in range(100):
            # Large batch for 3g.40gb
            x = torch.randn(256, 2048).cuda()
            y = torch.randint(0, 10, (256,)).cuda()

            optimizer.zero_grad()
            output = model(x)
            loss = criterion(output, y)
            loss.backward()
            optimizer.step()

            if epoch % 10 == 0:
                print(f"Large Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
            time.sleep(3)

        print("Large training completed on 3g.40gb MIG instance")

  medium-training-script.py: |
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import time
    import os

    print(f"=== MEDIUM TRAINING POD (2g.20gb) ===")
    print(f"Process ID: {os.getpid()}")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")

    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Using GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")

        # Medium model for 2g.20gb instance
        model = nn.Sequential(
            nn.Linear(1024, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        ).cuda()

        optimizer = optim.Adam(model.parameters())
        criterion = nn.CrossEntropyLoss()

        print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")

        # Training loop
        for epoch in range(100):
            # Medium batch for 2g.20gb
            x = torch.randn(128, 1024).cuda()
            y = torch.randint(0, 10, (128,)).cuda()

            optimizer.zero_grad()
            output = model(x)
            loss = criterion(output, y)
            loss.backward()
            optimizer.step()

            if epoch % 10 == 0:
                print(f"Medium Training - Epoch {epoch}, Loss: {loss.item():.4f}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
            time.sleep(4)

        print("Medium training completed on 2g.20gb MIG instance")

  small-inference-script.py: |
    import torch
    import torch.nn as nn
    import time
    import os

    print(f"=== SMALL INFERENCE POD (1g.10gb) ===")
    print(f"Process ID: {os.getpid()}")
    print(f"GPU available: {torch.cuda.is_available()}")
    print(f"GPU count: {torch.cuda.device_count()}")

    if torch.cuda.is_available():
        device = torch.cuda.current_device()
        print(f"Using GPU: {torch.cuda.get_device_name(device)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(device).total_memory / 1e9:.1f} GB")

        # Small model for 1g.10gb instance
        model = nn.Sequential(
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        ).cuda()

        print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")

        # Inference loop
        for i in range(200):
            with torch.no_grad():
                # Small batch for 1g.10gb
                x = torch.randn(32, 512).cuda()
                output = model(x)
                prediction = torch.argmax(output, dim=1)

                if i % 20 == 0:
                    print(f"Small Inference - Batch {i}, Predictions: {prediction[:5].tolist()}, GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB")
            time.sleep(2)

        print("Small inference completed on 1g.10gb MIG instance")
---
# Pod 1: Large training workload (3g.40gb)
apiVersion: v1
kind: Pod
metadata:
  name: mig-large-training-pod
  namespace: mig-gpu
  labels:
    app: mig-large-training
    workload-type: training
spec:
  restartPolicy: Never
  containers:
  - name: large-training-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/large-training-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: mig-large-claim
  resourceClaims:
  - name: mig-large-claim
    resourceClaimTemplateName: mig-large-template
  nodeSelector:
    node.kubernetes.io/instance-type: p4de.24xlarge
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  volumes:
  - name: script-volume
    configMap:
      name: mig-scripts-configmap
      defaultMode: 0755
---
# Pod 2: Medium training workload (2g.20gb) - can run on SAME GPU as Pod 1
apiVersion: v1
kind: Pod
metadata:
  name: mig-medium-training-pod
  namespace: mig-gpu
  labels:
    app: mig-medium-training
    workload-type: training
spec:
  restartPolicy: Never
  containers:
  - name: medium-training-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/medium-training-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: mig-medium-claim
  resourceClaims:
  - name: mig-medium-claim
    resourceClaimTemplateName: mig-medium-template
  nodeSelector:
    node.kubernetes.io/instance-type: p4de.24xlarge
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  volumes:
  - name: script-volume
    configMap:
      name: mig-scripts-configmap
      defaultMode: 0755
---
# Pod 3: Small inference workload (1g.10gb) - can run on SAME GPU as Pod 1 & 2
apiVersion: v1
kind: Pod
metadata:
  name: mig-small-inference-pod
  namespace: mig-gpu
  labels:
    app: mig-small-inference
    workload-type: inference
spec:
  restartPolicy: Never
  containers:
  - name: small-inference-container
    image: nvcr.io/nvidia/pytorch:25.04-py3
    command: ["python", "/scripts/small-inference-script.py"]
    volumeMounts:
    - name: script-volume
      mountPath: /scripts
      readOnly: true
    resources:
      claims:
      - name: mig-small-claim
  resourceClaims:
  - name: mig-small-claim
    resourceClaimTemplateName: mig-small-template
  nodeSelector:
    node.kubernetes.io/instance-type: p4de.24xlarge
    nvidia.com/gpu.present: "true"
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  volumes:
  - name: script-volume
    configMap:
      name: mig-scripts-configmap
      defaultMode: 0755

Deploy the Example:

Deploy MIG GPU Partitioning
kubectl apply -f mig-claim-template.yaml
kubectl apply -f mig-pod.yaml
kubectl get pods -n mig-gpu -w

Best For:

Multi-tenant environments requiring strict isolation
Predictable performance requirements
Production workloads requiring guaranteed resources
Compliance scenarios requiring hardware-level isolation

MIG Requirements

Hardware-level partitioning creates isolated GPU instances
Each MIG instance has dedicated compute units and memory
Complete fault isolation between instances
Requires GPU Operator with MIG Manager for dynamic reconfiguration

Strategy Selection Guide

Workload Type	Recommended Strategy	Key Benefit
Small Inference Jobs	Time-slicing or MPS	Higher GPU utilization
Concurrent Small Models	MPS	True parallelism
Production Multi-tenant	MIG	Hardware isolation
Large Model Training	Basic Allocation	Maximum performance
Development/Testing	Time-slicing	Flexibility and simplicity

Cleanup

Removing DRA Components

1️⃣ Clean Up DRA Examples
2️⃣ JARK Stack Cleanup

Remove all DRA example workloads:

Clean up DRA workloads
# Delete all pods first to ensure proper cleanup
kubectl delete pod inference-pod-1 -n timeslicing-gpu --ignore-not-found
kubectl delete pod training-pod-2 -n timeslicing-gpu --ignore-not-found
kubectl delete pod mps-workload -n mps-gpu --ignore-not-found
kubectl delete pod mig-workload -n mig-gpu --ignore-not-found
kubectl delete pod basic-gpu-pod -n gpu-test1 --ignore-not-found

# Delete ResourceClaimTemplates
kubectl delete resourceclaimtemplate timeslicing-gpu-template -n timeslicing-gpu --ignore-not-found
kubectl delete resourceclaimtemplate mps-gpu-template -n mps-gpu --ignore-not-found
kubectl delete resourceclaimtemplate mig-gpu-template -n mig-gpu --ignore-not-found
kubectl delete resourceclaimtemplate basic-gpu-template -n gpu-test1 --ignore-not-found

# Delete any remaining ResourceClaims
kubectl delete resourceclaims --all --all-namespaces --ignore-not-found

# Delete ConfigMaps (contain scripts)
kubectl delete configmap timeslicing-scripts-configmap -n timeslicing-gpu --ignore-not-found

# Finally delete namespaces
kubectl delete namespace timeslicing-gpu --ignore-not-found
kubectl delete namespace mps-gpu --ignore-not-found
kubectl delete namespace mig-gpu --ignore-not-found
kubectl delete namespace gpu-test1 --ignore-not-found

# Verify cleanup
kubectl get resourceclaims --all-namespaces
kubectl get resourceclaimtemplates --all-namespaces

For JARK-deployed clusters, use the automated cleanup:

JARK Stack Complete Cleanup
# Navigate to JARK directory
cd ai-on-eks/infra/jark-stack/terraform/_LOCAL

# Run the cleanup script
chmod +x cleanup.sh
./cleanup.sh

# Alternative: Manual terraform destroy
# terraform destroy -var-file=terraform/blueprint.tfvars -auto-approve

Complete Infrastructure Removal

This will remove the entire EKS cluster and all associated resources. Ensure you have backed up any important data before proceeding.

🔧 Troubleshooting Common Issues

🔍 Pods Stuck in Pending
⚠️ GPU Sharing Conflicts
📊 Performance Issues

Issue: Pods with ResourceClaims stuck in Pending state

Diagnosis:

# Check ResourceClaim status
kubectl get resourceclaims --all-namespaces -o wide

# Check DRA driver logs
kubectl logs -n gpu-operator -l app=nvidia-dra-driver --tail=100

# Verify DeviceClasses exist
kubectl get deviceclasses

Resolution:

# Restart DRA driver pods
kubectl delete pods -n gpu-operator -l app=nvidia-dra-driver

# Check node GPU availability
kubectl describe nodes | grep -A 10 "Allocatable"

Issue: Incompatible sharing strategies on same GPU

Diagnosis:

# Check ResourceSlice allocation
kubectl get resourceslices -o yaml

# Verify current allocations
kubectl get resourceclaims --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.status.allocation.deviceResults[*].device}{"\n"}{end}'

Resolution:

# Remove conflicting ResourceClaims
kubectl delete resourceclaim <conflicting-claim> -n <namespace>

# Wait for resource cleanup
kubectl wait --for=delete resourceclaim <claim-name> -n <namespace> --timeout=60s

Issue: Suboptimal GPU utilization or performance

Monitoring:

# Check GPU utilization
kubectl exec -it <gpu-pod> -n <namespace> -- nvidia-smi

# Monitor ResourceClaim allocation
kubectl get events --field-selector reason=ResourceClaimAllocated --sort-by='.lastTimestamp'

# Check sharing strategy effectiveness
kubectl logs <workload-pod> -n <namespace> | grep -i gpu

Optimization:

Review sharing strategy selection (MPS vs time-slicing vs exclusive)
Validate workload resource requirements match allocation
Consider MIG partitioning for predictable isolation

Conclusion

Dynamic Resource Allocation represents a fundamental shift from rigid GPU allocation to intelligent, workload-aware resource management. By leveraging structured ResourceClaims and vendor-specific drivers, DRA unlocks the GPU utilization rates necessary for cost-effective AI/ML operations at enterprise scale.

🚀 Ready to Transform Your GPU Infrastructure?

With the simplified JARK-based deployment approach, organizations can implement production-grade DRA capabilities in three steps, transforming their GPU infrastructure from a static resource pool into a dynamic, intelligent platform optimized for modern AI workloads.

The combination of EKS's managed infrastructure, NVIDIA's driver ecosystem, and Kubernetes' declarative model creates a powerful foundation for next-generation AI workloads - from small inference jobs to multi-node distributed training on GB200 superchips.

DRA Advantages over Traditional GPU Scheduling​

Why Managed/Self-Managed Node Groups vs Karpenter for DRA?​

Can I Use Both Traditional GPU Allocation and DRA Together?​

Production Readiness​

Enterprise GPU Utilization Crisis

The GPU Scheduling Challenge in Kubernetes​

Current State: Traditional GPU Allocation​

The GPU Utilization Crisis​

Enter Dynamic Resource Allocation (DRA)​

What DRA Changes​

Key DRA Innovations​

Fine-grained Resource Control

Per-Workload Sharing Strategies

Topology-Aware Scheduling

Future-Proof Architecture

Understanding IMEX, ComputeDomains, and Amazon EC2 P6e-GB200 Multi-Node Scheduling​

Implementation Considerations for EKS​

Managed Node Groups vs Karpenter for P-Series GPU Instances and DRA​

DRA and Traditional GPU Allocation Coexistence​

Visual Comparison: Traditional vs DRA​

Technical Capabilities Comparison​

How DRA Actually Works: The Complete Technical Flow​

Step-by-step DRA Workflow​

1. Resource Discovery and Advertisement​

2. DeviceClass Registration​

3. Resource Claim Creation​

4. Intelligent Scheduling​

5. Dynamic Allocation​

Deploying the Solution​

Prerequisites

Deploy

Verify

Prerequisites​

Deploy​

1. Clone the repository:​

2. Review and customize configurations:​

Prerequisites

Deploy

Verify

3. Navigate to the deployment directory and run the install script:​

Prerequisites

Deploy

Verify

4. Verify Deployment​

Component Architecture​

GPU Sharing Strategies: Technical Deep Dive​

Basic GPU Allocation​

What is Time-Slicing?​

How to Deploy Time-Slicing with DRA​

What is MPS?​

How to Deploy MPS with DRA​

What is MIG?​

How to Deploy MIG with DRA​

Strategy Selection Guide​

Cleanup​

Removing DRA Components​

Conclusion​

DRA Advantages over Traditional GPU Scheduling

Why Managed/Self-Managed Node Groups vs Karpenter for DRA?

Can I Use Both Traditional GPU Allocation and DRA Together?

Production Readiness

The GPU Scheduling Challenge in Kubernetes

Current State: Traditional GPU Allocation

The GPU Utilization Crisis

Enter Dynamic Resource Allocation (DRA)

What DRA Changes

Key DRA Innovations

Understanding IMEX, ComputeDomains, and Amazon EC2 P6e-GB200 Multi-Node Scheduling

Implementation Considerations for EKS

Managed Node Groups vs Karpenter for P-Series GPU Instances and DRA

DRA and Traditional GPU Allocation Coexistence

Visual Comparison: Traditional vs DRA

Technical Capabilities Comparison

How DRA Actually Works: The Complete Technical Flow

Step-by-step DRA Workflow

1. Resource Discovery and Advertisement

2. DeviceClass Registration

3. Resource Claim Creation

4. Intelligent Scheduling

5. Dynamic Allocation

Deploying the Solution

Prerequisites

Deploy

1. Clone the repository:

2. Review and customize configurations:

3. Navigate to the deployment directory and run the install script:

4. Verify Deployment

Component Architecture

GPU Sharing Strategies: Technical Deep Dive

Basic GPU Allocation

What is Time-Slicing?

How to Deploy Time-Slicing with DRA

What is MPS?

How to Deploy MPS with DRA

What is MIG?

How to Deploy MIG with DRA

Strategy Selection Guide

Cleanup

Removing DRA Components

Conclusion