HyperPod Training Operator Overview
The Amazon SageMaker HyperPod training operator helps you accelerate generative AI model development by efficiently managing distributed training across large GPU clusters. It introduces intelligent fault recovery, hang job detection, and process-level management capabilities that minimize training disruptions and reduce costs.
For more information on our docs, please see Using the HyperPod training operator
Key Features
Intelligent Fault Recovery
Unlike traditional training infrastructure that requires complete job restarts when failures occur, this operator implements surgical process recovery to keep your training jobs running smoothly. The operator can restart individual processes or containers without affecting the entire distributed training job.
Hang Job Detection
The operator provides automated monitoring of critical metrics like loss spikes and throughput degradation through configurable log monitoring rules. You can define recovery policies through simple YAML configurations without code changes, allowing you to quickly respond to and recover from unrecoverable training states.
Process-Level Management
The HyperPod training operator works at the process level, providing fine-grained control over distributed training workloads. This enables more efficient resource utilization and faster recovery from failures.
Integration with HyperPod Ecosystem
The operator integrates seamlessly with HyperPod's health monitoring and observability functions, providing real-time visibility into training execution. These monitoring and recovery capabilities work together to maintain optimal training performance while minimizing operational overhead.
Architecture Components
HyperPod Elastic Agent
The HyperPod elastic agent is an extension of PyTorch's ElasticAgent that orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. It must be installed in your training image before submitting jobs.
Training Operator Controller
The controller manages the lifecycle of distributed training jobs, handles fault detection and recovery, and coordinates with Kubernetes to manage pod resources.
Log Monitoring System
Advanced log monitoring capabilities that can detect various training issues:
- Job hanging detection
- Training loss spikes
- Low throughput detection
- Checkpoint upload failures
Supported Versions
The HyperPod training operator works only with specific versions of components:
- Kubernetes versions: 1.28, 1.29, 1.30, 1.31, or 1.32
- Suggested Kueue versions: v0.12.2 and v0.12.3
- HyperPod AMI: Latest release (use UpdateClusterSoftware API to upgrade)
- PyTorch: 2.4.0 – 2.7.1
Optional Integrations
Kueue Integration
While Kueue is not required for the training operator, your cluster administrator can install and configure it for enhanced job scheduling capabilities. The operator supports external framework integration with Kueue for resource allocation and job queuing.
Task Governance Integration
The training operator is integrated with HyperPod task governance, a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects. Task governance requires version v1.3.0-eksbuild.1 or higher.
Monitoring and Observability
The operator provides comprehensive metrics that can be scraped by Amazon Managed Service for Prometheus:
hyperpod_training_operator_jobs_created_total
: Total number of jobs createdhyperpod_training_operator_jobs_restart_latency
: Current job restart latencyhyperpod_training_operator_jobs_fault_detection_latency
: Fault detection latencyhyperpod_training_operator_jobs_deleted_total
: Total number of deleted jobshyperpod_training_operator_jobs_successful_total
: Total number of completed jobshyperpod_training_operator_jobs_failed_total
: Total number of failed jobshyperpod_training_operator_jobs_restarted_total
: Total number of auto-restarted jobs
Prerequisites
Before using the HyperPod training operator, ensure you have:
- HyperPod Cluster: Created a HyperPod cluster with Amazon EKS orchestration
- Latest AMI: Installed the latest AMI on your HyperPod cluster
- Cert-Manager: Installed cert-manager in your cluster
- EKS Pod Identity Agent: Set up using the console or AWS CLI
Next Steps
To get started with the HyperPod training operator:
- Review the Installation and Usage Guide
- Install the operator through the SageMaker AI console, Amazon EKS console, or AWS CLI
- Configure your training images with the HyperPod elastic agent
- Submit your first distributed training job
The HyperPod training operator represents a significant advancement in managing distributed training workloads, providing the reliability and efficiency needed for large-scale generative AI model development.