GPU Stress Testing and Validation

GPU stress testing validates hardware stability, thermal management, and performance consistency by putting GPUs under sustained computational load. This guide focuses on "burning" GPUs to test their limits and detect potential hardware issues.

Overview

GPU stress testing provides:

Hardware stability validation under sustained load
Thermal throttling detection and mitigation
Memory error detection and validation
Power delivery validation under peak loads
Performance consistency verification across all GPUs

Prerequisites

NVIDIA GPU Testing

NVIDIA GPUs (P4d, P5, P6e instances)
NVIDIA drivers and CUDA toolkit
DCGM (Data Center GPU Manager) - pre-installed on HyperPod
Administrative access for hardware monitoring

NVIDIA GPU Stress Testing

1. DCGM Diagnostic Tests

DCGM provides comprehensive GPU diagnostics and stress testing capabilities:

# Initialize DCGM and discover GPUs
dcgmi discovery -l

# Create a group for testing (optional)
dcgmi group -c stress_test_group
dcgmi group -g 1 -a 0,1,2,3,4,5,6,7  # Add all GPUs to group

# Run different levels of diagnostics
dcgmi diag -g 1 -r 1  # Level 1 (quick)
dcgmi diag -g 1 -r 2  # Level 2 (medium stress)  
dcgmi diag -g 1 -r 3  # Level 3 (intensive)

# Monitor GPU health
dcgmi health -g 1 -c

2. GPU Burn Tool

GPU Burn is a simple, effective tool for maximizing GPU usage using the gpu-burn repository:

# Install GPU Burn
git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn
make

# Run stress test for different durations
./gpu_burn 60    # 1 minute test
./gpu_burn 300   # 5 minute test
./gpu_burn 1800  # 30 minute test

# Run on specific GPUs
CUDA_VISIBLE_DEVICES=0,1 ./gpu_burn 300

Monitoring During Stress Tests

Real-time GPU Monitoring

# Monitor GPU metrics continuously
watch -n 1 nvidia-smi

# Monitor with detailed metrics and logging
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,power.draw,memory.used,memory.total,utilization.gpu \
           --format=csv -l 1 > gpu_stress_log.csv

# Monitor for thermal throttling
watch -n 1 'nvidia-smi --query-gpu=temperature.gpu,power.draw --format=csv,noheader,nounits | awk -F, "{if(\$1>85) print \"WARNING: GPU temp \"\$1\"°C\"}"'

DCGM Monitoring

# Start DCGM daemon (if not running)
sudo nv-hostengine

# Monitor GPU health continuously
dcgmi health -c

# Monitor detailed metrics
dcgmi dmon -e 1001,1002,1003,1004,1005,1006 -c 100

Stress Test Analysis

Temperature Analysis

# Extract temperature statistics from nvidia-smi CSV
awk -F',' 'NR>1 {sum+=$3; if($3>max) max=$3; if(min=="" || $3<min) min=$3} END {print "Temp - Min:"min"°C, Max:"max"°C, Avg:"sum/(NR-1)"°C"}' gpu_stress_log.csv

# Check for thermal throttling events
awk -F',' 'NR>1 && $3>85 {print "WARNING: High temperature "$3"°C at "$1}' gpu_stress_log.csv

Power Analysis

# Extract power statistics
awk -F',' 'NR>1 {sum+=$4; if($4>max) max=$4; if(min=="" || $4<min) min=$4} END {print "Power - Min:"min"W, Max:"max"W, Avg:"sum/(NR-1)"W"}' gpu_stress_log.csv

Best Practices

1. Pre-Stress Checklist

✅ Verify cooling systems are operational
✅ Check power delivery capacity
✅ Ensure monitoring tools are configured
✅ Set up temperature alerts
✅ Plan for graceful shutdown procedures

2. During Stress Testing

Monitor temperatures continuously (keep below 85°C)
Watch for power throttling
Check for memory errors
Validate performance consistency
Be ready to stop tests if temperatures spike

3. Post-Stress Analysis

Review thermal profiles for hot spots
Check for hardware errors in logs
Validate all GPUs performed consistently
Document any issues found
Plan remediation for problematic hardware

Troubleshooting

High Temperatures

# Check cooling system status
sensors  # If available
nvidia-smi --query-gpu=temperature.gpu --format=csv

# Reduce stress test intensity
./gpu_burn 60  # Shorter duration

Memory Errors

# Check for GPU memory errors
nvidia-smi --query-gpu=memory.ecc.errors.corrected.total,memory.ecc.errors.uncorrected.total --format=csv

# Run DCGM memory test
dcgmi diag -r 2  # Includes memory tests

Performance Inconsistencies

# Check GPU clocks
nvidia-smi --query-gpu=clocks.gr,clocks.mem --format=csv

# Monitor for throttling
nvidia-smi --query-gpu=clocks_throttle_reasons.active --format=csv

Next Steps

After successful GPU stress testing:

Document baseline thermal and performance characteristics
Set up continuous monitoring and alerting
Establish regular stress testing schedules
Create procedures for handling hardware failures
Proceed with confidence to production workloads

Overview​

Prerequisites​

NVIDIA GPU Testing​

NVIDIA GPU Stress Testing​

1. DCGM Diagnostic Tests​

2. GPU Burn Tool​

Monitoring During Stress Tests​

Real-time GPU Monitoring​

DCGM Monitoring​

Stress Test Analysis​

Temperature Analysis​

Power Analysis​

Best Practices​

1. Pre-Stress Checklist​

2. During Stress Testing​

3. Post-Stress Analysis​

Troubleshooting​

High Temperatures​

Memory Errors​

Performance Inconsistencies​

Next Steps​