Resiliency

Reference Documentation

For production resiliency configuration and best practices, see Slurm Resiliency.

Overview

SageMaker HyperPod is built for resilient training - it continuously monitors the cluster using the following health checks:

Health Check	Instance Type	Description
DCGM policies	NVIDIA GPUs	Continuously monitors all GPU-related policies from NVIDIA DCGM
NVIDIA SMI	NVIDIA GPUs	nvidia-smi utility to manage and monitor GPUs
XID	NVIDIA GPUs	Monitors kernel logs for any XID message
Neuron sysfs	Trainium/Inferentia	Health of Neuron devices via Neuron sysfs
EFA	All	Diagnostic of Elastic Fabric Adaptor devices
DCGM Diagnostic	NVIDIA GPUs	DCGM diagnostics level 2 stress testing
CPU stress	All	Linux stress tool for CPU health

In this example we'll:

Submit a training job using Picotron with checkpointing enabled
Inject an Xid Error
Observe the cluster to ensure it properly recovers from the last checkpoint file

Prerequisites

Before proceeding, make sure you've completed the Picotron Training section.