Resiliency
Reference Documentation
For production resiliency configuration and best practices, see Slurm Resiliency.
Overview
SageMaker HyperPod is built for resilient training - it continuously monitors the cluster using the following health checks:
| Health Check | Instance Type | Description |
|---|---|---|
| DCGM policies | NVIDIA GPUs | Continuously monitors all GPU-related policies from NVIDIA DCGM |
| NVIDIA SMI | NVIDIA GPUs | nvidia-smi utility to manage and monitor GPUs |
| XID | NVIDIA GPUs | Monitors kernel logs for any XID message |
| Neuron sysfs | Trainium/Inferentia | Health of Neuron devices via Neuron sysfs |
| EFA | All | Diagnostic of Elastic Fabric Adaptor devices |
| DCGM Diagnostic | NVIDIA GPUs | DCGM diagnostics level 2 stress testing |
| CPU stress | All | Linux stress tool for CPU health |
Test Case
In this example we'll:
- Submit a training job using Picotron with checkpointing enabled
- Inject an Xid Error
- Observe the cluster to ensure it properly recovers from the last checkpoint file
Prerequisites
Before proceeding, make sure you've completed the Picotron Training section.