PyTorch Environment Validation

This validation script runs a comprehensive PyTorch environment check to screen for NCCL, MPI, OpenMP, CUDA, and other critical components on your HyperPod cluster. The script executes once per instance and helps verify that your environment is properly configured for distributed training.

Overview

The PyTorch environment validation performs the following checks:

CUDA availability and configuration
PyTorch backend status (CUDA, cuDNN, MKL, OpenMP)
Distributed training capabilities (NCCL, MPI)
GPU driver and library versions
Container runtime validation

Prerequisites

For Slurm Clusters

Functional Slurm cluster on AWS
Docker, Pyxis and Enroot installed
Shared directory mounted (typically /fsx or /apps)
AWS Deep Learning Container access

For EKS Clusters

Functional EKS cluster with GPU nodes
NVIDIA device plugin deployed
Container registry access (ECR or public registries)
kubectl configured for cluster access

Slurm Implementation

1. Get the Validation Scripts

The PyTorch validation scripts are available in the awsome-distributed-training repository.

# Clone the repository
git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/4.validation_and_observability/1.pytorch-env-validation

Available files:

pytorch-screen.py: Main validation script
1.torch-screen.sbatch: Slurm job script
0.pytorch-screen.Dockerfile: Container build file

Script Features

The pytorch-screen.py script provides comprehensive validation of:

PyTorch version and configuration
CUDA availability and device detection
cuDNN backend settings
Distributed training capabilities (NCCL, MPI)
Backend availability (MKL, OpenMP, opt_einsum)
Environment variable validation

1. Build the Validation Container

Use the provided Dockerfile from the awsome-distributed-training repository:

Dockerfile: 0.pytorch-screen.Dockerfile

Build and convert to Enroot format:

# Get the region
AWS_AZ=$(ec2-metadata --availability-zone | cut -d' ' -f2)
AWS_REGION=${AWS_AZ::-1}

# Authenticate with ECR
aws ecr get-login-password | docker login --username AWS \
   --password-stdin 763104351884.dkr.ecr.${AWS_REGION}.amazonaws.com/pytorch-training

# Build the container using the provided Dockerfile
docker build -t pytorch-validation -f 0.pytorch-screen.Dockerfile \
   --build-arg="AWS_REGION=${AWS_REGION}" .

# Convert to Enroot squash file
enroot import -o /fsx/pytorch-validation.sqsh dockerd://pytorch-validation:latest

3. Use the Provided Slurm Job Script

The repository includes a ready-to-use Slurm job script at 1.torch-screen.sbatch.

Key configuration options in the script:

Node count: Modify #SBATCH -N 2 to change number of nodes
Container image: Set IMAGE variable to your container path
Shared filesystem: Configure FSX_MOUNT for your setup

4. Run the Validation

# Submit the job
sbatch pytorch-validation.sbatch

# Monitor the output
tail -f slurm-<job-id>.out

EKS Implementation

1. Create Kubernetes Job Manifest

Since there's no pre-built Kubernetes manifest in the awsome-distributed-training repository for PyTorch validation, you can create a simple Job manifest:

# Create ConfigMap with the validation script
kubectl create configmap pytorch-validation-script \
  --from-file=pytorch-screen.py

# Create a basic Job manifest
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-validation
  namespace: default
spec:
  parallelism: 2
  completions: 2
  template:
    spec:
      restartPolicy: Never
      nodeSelector:
        node.kubernetes.io/instance-type: "p5.48xlarge"
      containers:
      - name: pytorch-validation
        image: <YOUR_BUILT_CONTAINER_IMAGE>
        command: ["/bin/bash"]
        args:
        - -c
        - |
          echo "Node: \$(hostname)"
          nvidia-smi
          python /workspace/pytorch-screen.py
        resources:
          limits:
            nvidia.com/gpu: 8
            vpc.amazonaws.com/efa: 32
          requests:
            nvidia.com/gpu: 8
            vpc.amazonaws.com/efa: 32
        volumeMounts:
        - name: validation-script
          mountPath: /workspace
      volumes:
      - name: validation-script
        configMap:
          name: pytorch-validation-script
EOF

2. Monitor and View Results

# Monitor the job
kubectl get jobs -w

# View logs
kubectl logs -l job-name=pytorch-validation

Expected Output

The validation script will produce output similar to:

==================================================
 PyTorch Environment Validation
==================================================
PyTorch Version: 2.0.1+cu118
Python Version: 3.10.11

==================================================
 CUDA Configuration
==================================================
torch.cuda.is_available() = True
torch.version.cuda = 11.8
torch.backends.cuda.is_built() = True
CUDA Device Count: 8
  Device 0: NVIDIA H100 80GB HBM3
  Device 1: NVIDIA H100 80GB HBM3
  ...

==================================================
 Distributed Training
==================================================
torch.distributed.is_available() = True
torch.distributed.is_mpi_available() = True
torch.distributed.is_nccl_available() = True

==================================================
 Validation Complete
==================================================
Environment validation finished successfully!

Troubleshooting

Common Issues

CUDA not available
- Verify NVIDIA drivers are installed
- Check GPU resource allocation in job spec
- Ensure container has GPU access
NCCL not available
- Verify NCCL installation in container
- Check EFA device plugin deployment (EKS)
- Validate network configuration
Container mount issues (Slurm)
- Verify Enroot/Pyxis installation
- Check shared filesystem permissions
- Ensure squash file is accessible

Validation Checklist

✅ CUDA is available and detects all GPUs
✅ NCCL is available for distributed training
✅ MPI is available for multi-node communication
✅ EFA devices are accessible (if using EFA-enabled instances)
✅ Container can access shared storage
✅ Environment variables are properly set

Overview​

Prerequisites​

For Slurm Clusters​

For EKS Clusters​

Slurm Implementation​

1. Get the Validation Scripts​

Script Features​

1. Build the Validation Container​

3. Use the Provided Slurm Job Script​

4. Run the Validation​

EKS Implementation​

1. Create Kubernetes Job Manifest​

2. Monitor and View Results​

Expected Output​

Troubleshooting​

Common Issues​

Validation Checklist​