Skip to main content

HyperPod Training Operator Installation and Usage Guide

This guide covers the installation of the HyperPod training operator and provides examples for running distributed training jobs using examples from the awsome-distributed-training repository.

Prerequisites

Before you use the HyperPod training operator, you must have completed the following prerequisites:

aws eks create-addon \
--cluster-name my-eks-cluster \
--addon-name eks-pod-identity-agent \
--region AWS Region

Installation Methods

You can install the HyperPod training operator through three methods:

The SageMaker AI console provides a one-click installation that automatically:

  • Creates the IAM execution role
  • Creates the pod identity association
  • Installs the operator
  1. Open the Amazon SageMaker AI console
  2. Go to your cluster's details page
  3. On the Dashboard tab, locate "Amazon SageMaker HyperPod training operator"
  4. Choose Install

During installation, SageMaker AI creates an IAM execution role with permissions similar to the AmazonSageMakerHyperPodTrainingOperatorAccess managed policy.

Amazon EKS Console

The EKS console installation is similar but doesn't automatically create the IAM execution role. You can choose to create a new role during the process with pre-populated information.

AWS CLI

For programmatic installation with more customization options:

# Set up EKS Pod Identity Agent
aws eks create-addon \
--cluster-name my-eks-cluster \
--addon-name eks-pod-identity-agent \
--region <AWS_REGION>

Validate Installation

Once installed, verify the HyperPod controller manager pod is running:

kubectl get pods -n aws-hyperpod

Expected output:

NAME                                                              READY   STATUS    RESTARTS   AGE
health-monitoring-agent-bj57k 1/1 Running 0 17d
health-monitoring-agent-plcvm 1/1 Running 0 17d
hp-training-operator-hp-training-controller-manager-775bdf47f2s 1/1 Running 0 2d21h

Running Training Jobs

This example demonstrates how to run a HyperPod PytorchJob using the same FSDP example from the awsome-distributed-training repository, but configured for the HyperPod Training Operator.

1. Clone the awsome-distributed-training Repository

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/3.test_cases/pytorch/FSDP/kubernetes

2. Build and Push Docker Image

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/

# Build the container (note: this includes hyperpod-elastic-agent)
pushd ../
docker build -f Dockerfile ${DOCKER_NETWORK} -t ${REGISTRY}fsdp:pytorch2.7.1 .
popd

# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"fsdp\" | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
aws ecr create-repository --repository-name fsdp
fi

# Login and push
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
docker image push ${REGISTRY}fsdp:pytorch2.7.1

The Dockerfile includes the HyperPod elastic agent installation:

...
RUN pip install hyperpod-elastic-agent
...

3. Job Submission Methods

HyperPod PytorchJobs can be submitted via kubectl with a YAML manifest or via the HyperPod CLI v3.

3a. Job Submission via kubectl

The llama3_1_8b-fsdp-hpto.yaml file defines a HyperPod PytorchJob with robust error handling:

Key Features:

  • JobStart: Fails if no "Loss:" appears in logs within first 4 minutes (240s)
  • JobHangingDetection: Fails if gap between "Loss:" logs exceeds 10 minutes (600s)
  • Retry Policy: 3 process restarts before full job restart, maximum 10 total retries

For auto-resume from checkpoint, add FSx for Lustre volumes and modify checkpoint paths:

volumes:
- name: fsx-storage
persistentVolumeClaim:
claimName: fsx-claim

volumeMounts:
- name: fsx-storage
mountPath: /fsx

# Update command args:
- '--checkpoint_dir=/fsx/checkpoints'
- '--resume_from_checkpoint=/fsx/checkpoints'

Submit the job:

envsubst < llama3_1_8b-fsdp-hpto.yaml | kubectl apply -f -

4. Monitor Training Jobs

Install kubetail for log monitoring:

curl -sL https://raw.githubusercontent.com/aws-samples/aws-do-eks/refs/heads/main/Container-Root/eks/ops/setup/install-kubetail.sh | sudo bash

View logs:

kubetail llama3

Describe the HyperPodPytorchJob:

kubectl describe hyperpodpytorchjob llama3-1-8b-fsdp

5. Testing Resiliency

Emulate an instance failure to test the operator's recovery capabilities:

export NODE=$(kubectl get nodes | awk 'NR>1 {print $1}' | shuf -n 1)
kubectl label node $NODE \
sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot \
--overwrite=true

Check the job status:

kubectl describe hyperpodpytorchjob

Expected output showing fault remediation:

Status:
Conditions:
Last Transition Time: 2025-08-04T22:06:18Z
Status: True
Type: Created
Last Transition Time: 2025-08-04T22:08:37Z
Status: True
Type: PodsRunning
Last Transition Time: 2025-08-04T22:08:44Z
Message: The fault of reason NodeFault was remediated in 94283 milliseconds.
Reason: Running
Status: True
Type: Running
Restart Count: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning NodeFault 117s hyperpod-pytorch-job-controller Found unhealthy node hyperpod-i-03d315d8cef22bd25
Normal Running 23s hyperpod-pytorch-job-controller The fault of reason NodeFault was remediated in 94283 milliseconds.

Log Monitoring Configuration Parameters

The following table describes all possible log monitoring configurations:

ParameterDescription
jobMaxRetryCountMaximum number of restarts at the process level
restartPolicy: numRestartBeforeFullJobRestartMaximum number of restarts at the process level before the operator restarts at the job level
restartPolicy: evalPeriodSecondsThe period of evaluating the restart limit in seconds
restartPolicy: maxFullJobRestartsMaximum number of full job restarts before the job fails
cleanPodPolicySpecifies the pods that the operator should clean. Accepted values are All, OnlyComplete, and None
logMonitoringConfigurationThe log monitoring rules for slow and hanging job detection
expectedRecurringFrequencyInSecondsTime interval between two consecutive LogPattern matches after which the rule evaluates to HANGING
expectedStartCutOffInSecondsTime to first LogPattern match after which the rule evaluates to HANGING
logPatternRegular expression that identifies log lines that the rule applies to when the rule is active
metricEvaluationDataPointsNumber of consecutive times a rule must evaluate to SLOW before marking a job as SLOW
metricThresholdThreshold for value extracted by LogPattern with a capturing group
operatorThe inequality to apply to the monitoring configuration. Accepted values are gt, gteq, lt, lteq, and eq
stopPatternRegular expression to identify the log line at which to deactivate the rule

Advanced Configuration Examples

Testing Custom Log Monitoring

To test custom log monitoring configurations, modify your job's logMonitoringConfiguration:

logMonitoringConfiguration:
- name: JobStart
logPattern: '.*Loss:.*'
expectedStartCutOffInSeconds: 1 # Change from 240 to 1 for testing

This will trigger a LogStateHanging_JobStart error if training doesn't start within 1 second, allowing you to test the monitoring system.

HyperPod Elastic Agent Arguments

The HyperPod elastic agent supports all PyTorch ElasticAgent arguments plus additional ones:

ArgumentDescriptionDefault
--shutdown-signalSignal to send to workers for shutdown"SIGKILL"
--shutdown-timeoutTimeout between SIGTERM and SIGKILL30
--server-hostAgent server address"0.0.0.0"
--server-portAgent server port8080
--server-log-levelAgent server log level"info"
--server-shutdown-timeoutServer shutdown timeout300
--pre-train-scriptPath to pre-training scriptNone
--pre-train-argsArguments for pre-training scriptNone
--post-train-scriptPath to post-training scriptNone
--post-train-argsArguments for post-training scriptNone

Troubleshooting

Installation Issues

Incompatible HyperPod AMI: Update to the latest version using the UpdateClusterSoftware API.

Incompatible Task Governance Version: Ensure HyperPod task governance is version v1.3.0-eksbuild.1 or higher.

Missing Permissions: Verify IAM permissions are correctly set up for the EKS Pod Identity Agent.

Job Execution Issues

Jobs Not Starting: Check that the HyperPod elastic agent is properly installed in your training image.

Log Monitoring Not Working: Ensure training logs are emitted to sys.stdout and saved at /tmp/hyperpod/.

The key advantage of the HyperPod Training Operator is that jobs are restarted at the process level within container pods, rather than affecting all pods. This provides surgical recovery that keeps training running smoothly with minimal disruption.