Skip to main content

SageMaker HyperPod Resiliency Overview

SageMaker HyperPod is built for resilient training with comprehensive health monitoring and automatic recovery capabilities. This section provides an overview of the resiliency features that apply to both HyperPod EKS and HyperPod Slurm orchestrators.

Health Monitoring Agent

SageMaker HyperPod health-monitoring agent (HMA) continuously monitors the health status of each GPU-based or Trainium-based instance. When it detects any instance or GPU failures, the agent marks the instance as unhealthy.

The SageMaker HyperPod HMA performs the same health checks for both EKS and Slurm orchestrators, providing consistent monitoring across different orchestration platforms.

Health Checks Performed by HMA

The SageMaker HyperPod health-monitoring agent performs comprehensive health checks across different hardware components:

NVIDIA GPUs

  • DCGM policy violation notifications: Monitors all GPU-related policies from NVIDIA DCGM
  • NVIDIA SMI errors: Parses output from nvidia-smi to determine GPU health
  • XID errors: Monitors kernel logs for XID messages indicating hardware malfunctions
  • GPU Count validation: Verifies expected GPU count matches actual count (e.g., 8 GPUs in ml.p5.48xlarge). Reboots node if mismatch detected
  • Various EC2 platform log errors: Monitors Amazon EC2 generated logs for issues

AWS Trainium

  • Neuron monitor errors: Checks output from AWS Neuron monitor for issues
  • Neuron node problem detector: Uses outputs from the Neuron node problem detector for comprehensive health assessment
  • Neuron Device Count validation: Verifies actual neuron device count matches expected count for instance type. Reboots node if mismatch detected
  • EC2 platform log monitoring: Monitors Amazon EC2 generated logs for Trainium-specific issues

Health Monitoring Agent Logs

The SageMaker HyperPod health-monitoring agent runs continuously on all HyperPod clusters and publishes detected health events to CloudWatch under the cluster log group /aws/sagemaker/Clusters/.

Detection logs are created as separate log streams named SagemakerHealthMonitoringAgent for each node. You can query these logs using CloudWatch Log Insights:

fields @timestamp, @message 
| filter @message like /HealthMonitoringAgentDetectionEvent/

Example output:

{
"level": "info",
"ts": "2024-08-21T18:35:35Z",
"msg": "NPD caught event: %v",
"details": {
"severity": "warn",
"timestamp": "2024-08-22T20:59:29Z",
"reason": "XidHardwareFailure",
"message": "Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6\""
},
"HealthMonitoringAgentDetectionEvent": "HealthEvent"
}

Basic Health Checks

SageMaker HyperPod performs orchestrator-agnostic basic health checks during cluster creation and updates. These checks monitor:

Health CheckInstance TypeDescription
DCGM policiesNVIDIA GPUsContinuous monitoring of GPU-related policies from NVIDIA DCGM
NVIDIA SMINVIDIA GPUsParsing nvidia-smi output to determine GPU health
XIDNVIDIA GPUsMonitoring kernel logs for XID messages indicating hardware malfunctions
Neuron sysfsTrainium/InferentiaReading counters from Neuron sysfs propagated by the Neuron driver
EFAAllConnectivity tests using all available EFA cards within the instance
DCGM DiagnosticNVIDIA GPUsDCGM diagnostics level 2 to exercise GPUs under pressure
CPU stressAllLinux stress tool running multiple threads for 100% CPU utilization and I/O operations

Deep Health Checks

SageMaker HyperPod performs deep health checks during cluster creation and updates to ensure reliability and stability by thoroughly testing underlying hardware and infrastructure components.

Instance-Level Deep Health Checks

CategoryUtility NameInstance TypeDescription
AcceleratorGPU/NVLink countGPUVerifies GPU/NVLink counts
AcceleratorDCGM diagnostics level 4GPUAssesses GPU health with DCGM diagnostics including memory tests
AcceleratorNeuron sysfsTrainiumDetermines Neuron device health by reading counters from Neuron sysfs
AcceleratorNeuron hardware checkTrainiumRuns training workload to test hardware functionality
AcceleratorNCCOM local testTrainiumEvaluates collective communication performance on single Trainium nodes
NetworkEFAGPU and TrainiumRuns latency and bandwidth benchmarking on attached EFA devices

Cluster-Level Deep Health Checks

CategoryUtility NameInstance TypeDescription
AcceleratorNCCL testGPUVerifies collective communication performance on multiple NVIDIA GPUs
AcceleratorNCCOM cluster testTrainiumVerifies collective communication performance on multiple Trainium nodes

Deep Health Check Logs

Cluster-Level Logs

Stored in CloudWatch log group: /aws/sagemaker/Clusters/<cluster_name>/<cluster_id> Log streams: DeepHealthCheckResults/<log_stream_id>

Example failure log:

{
"level": "error",
"ts": "2024-06-18T21:15:22Z",
"msg": "Encountered FaultyInstance. Replace the Instance. Region: us-west-2, InstanceType: p4d.24xlarge. ERROR:Bandwidth has less than threshold: Expected minimum threshold :80,NCCL Test output Bw: 30"
}

Instance-Level Logs

Stored locally at: /var/log/aws/clusters/sagemaker-deep-health-check.log

Access via SSH:

cat /var/log/aws/clusters/sagemaker-deep-health-check.log

Example outputs:

Hardware Stress Test:

2024-08-20T21:53:58Z info Executing Hardware stress check with command: stress-ng, and args: [--cpu 32 --vm 2 --hdd 1 --fork 8 --switch 4 --timeout 60 --metrics]
2024-08-20T21:54:58Z info stress-ng success
2024-08-20T21:54:58Z info GpuPci Count check success

DCGM Stress Test:

2024-08-20T22:25:02Z info DCGM diagnostic health summary: dcgmCheckLevel: 0 dcgmVersion: 3.3.7 gpuDriverVersion: 535.183.01, gpuDeviceIds: [2237] replacementRequired: false rebootRequired:false

EFA Loopback Test:

2024-08-20T22:26:28Z info EFA Loopback check passed for device: rdmap0s29 . Output summary is MaxBw: 58.590000, AvgBw: 32.420000, MaxTypicalLat: 30.870000, MinTypicalLat: 20.080000, AvgLat: 21.630000

Automatic Node Recovery

During cluster creation or update, administrators can select node recovery options:

  • Automatic (Recommended): SageMaker HyperPod automatically reboots or replaces faulty nodes
  • None: Health monitoring agent labels instances when faults are detected but does not initiate automatic recovery actions (not recommended)

Automatic node recovery is triggered by:

  • Health-monitoring agent detections
  • Basic health check failures
  • Deep health check failures

Kubernetes Labels for Resiliency (EKS Only)

SageMaker HyperPod uses Kubernetes labels to track node health status and deep health check progress.

Node Health Status Labels

LabelDescription
sagemaker.amazonaws.com/node-health-status: SchedulableNode passed basic health checks and is available for workloads
sagemaker.amazonaws.com/node-health-status: UnschedulableNode is running deep health checks and unavailable for workloads
sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacementNode failed checks and requires replacement
sagemaker.amazonaws.com/node-health-status: UnschedulablePendingRebootNode failed checks and requires reboot

Deep Health Check Labels

LabelDescription
sagemaker.amazonaws.com/deep-health-check-status: InProgressNode is running deep health checks
sagemaker.amazonaws.com/deep-health-check-status: PassedNode successfully completed all health checks
sagemaker.amazonaws.com/deep-health-check-status: FailedNode failed health checks and requires recovery

Fault Type and Reason Labels

  • fault-type labels: Represent high-level fault categories when health checks fail
  • fault-reason labels: Represent detailed fault reasons associated with a fault-type

Next Steps