Skip to main content

Resiliency

Reference Documentation

For production resiliency configuration and best practices, see Slurm Resiliency.

Overview

SageMaker HyperPod is built for resilient training - it continuously monitors the cluster using the following health checks:

Health CheckInstance TypeDescription
DCGM policiesNVIDIA GPUsContinuously monitors all GPU-related policies from NVIDIA DCGM
NVIDIA SMINVIDIA GPUsnvidia-smi utility to manage and monitor GPUs
XIDNVIDIA GPUsMonitors kernel logs for any XID message
Neuron sysfsTrainium/InferentiaHealth of Neuron devices via Neuron sysfs
EFAAllDiagnostic of Elastic Fabric Adaptor devices
DCGM DiagnosticNVIDIA GPUsDCGM diagnostics level 2 stress testing
CPU stressAllLinux stress tool for CPU health

Test Case

In this example we'll:

  1. Submit a training job using Picotron with checkpointing enabled
  2. Inject an Xid Error
  3. Observe the cluster to ensure it properly recovers from the last checkpoint file
Prerequisites

Before proceeding, make sure you've completed the Picotron Training section.