HyperPod Training Operator Installation and Usage Guide

This guide covers the installation of the HyperPod training operator and provides examples for running distributed training jobs using examples from the awsome-distributed-training repository.

Prerequisites

Before you use the HyperPod training operator, you must have completed the following prerequisites:

Created a HyperPod cluster with Amazon EKS orchestration.
Installed the latest AMI on your HyperPod cluster. For more information, see SageMaker HyperPod AMI releases for Amazon EKS.
Installed cert-manager.
Set up the EKS Pod Identity Agent using the console. If you want to use the AWS CLI, use the following command:

aws eks create-addon \
--cluster-name my-eks-cluster \
--addon-name eks-pod-identity-agent \
--region AWS Region

Installation Methods

You can install the HyperPod training operator through three methods:

SageMaker AI Console (Recommended)

The SageMaker AI console provides a one-click installation that automatically:

Creates the IAM execution role
Creates the pod identity association
Installs the operator

Open the Amazon SageMaker AI console
Go to your cluster's details page
On the Dashboard tab, locate "Amazon SageMaker HyperPod training operator"
Choose Install

During installation, SageMaker AI creates an IAM execution role with permissions similar to the AmazonSageMakerHyperPodTrainingOperatorAccess managed policy.

Amazon EKS Console

The EKS console installation is similar but doesn't automatically create the IAM execution role. You can choose to create a new role during the process with pre-populated information.

AWS CLI

For programmatic installation with more customization options:

# Set up EKS Pod Identity Agent
aws eks create-addon \
--cluster-name my-eks-cluster \
--addon-name eks-pod-identity-agent \
--region <AWS_REGION>

Validate Installation

Once installed, verify the HyperPod controller manager pod is running:

kubectl get pods -n aws-hyperpod

Expected output:

NAME                                                              READY   STATUS    RESTARTS   AGE
health-monitoring-agent-bj57k                                     1/1     Running   0          17d
health-monitoring-agent-plcvm                                     1/1     Running   0          17d
hp-training-operator-hp-training-controller-manager-775bdf47f2s   1/1     Running   0          2d21h

Running Training Jobs

This example demonstrates how to run a HyperPod PytorchJob using the same FSDP example from the awsome-distributed-training repository, but configured for the HyperPod Training Operator.

1. Clone the awsome-distributed-training Repository

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/3.test_cases/pytorch/FSDP/kubernetes

2. Build and Push Docker Image

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/

# Build the container (note: this includes hyperpod-elastic-agent)
pushd ../
docker build -f Dockerfile ${DOCKER_NETWORK} -t ${REGISTRY}fsdp:pytorch2.7.1 .
popd

# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"fsdp\" | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
    aws ecr create-repository --repository-name fsdp
fi

# Login and push
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
docker image push ${REGISTRY}fsdp:pytorch2.7.1

The Dockerfile includes the HyperPod elastic agent installation:

...
RUN pip install hyperpod-elastic-agent
...

3. Job Submission Methods

HyperPod PytorchJobs can be submitted via kubectl with a YAML manifest or via the HyperPod CLI v3.

3a. Job Submission via kubectl

The llama3_1_8b-fsdp-hpto.yaml file defines a HyperPod PytorchJob with robust error handling:

Key Features:

JobStart: Fails if no "Loss:" appears in logs within first 4 minutes (240s)
JobHangingDetection: Fails if gap between "Loss:" logs exceeds 10 minutes (600s)
Retry Policy: 3 process restarts before full job restart, maximum 10 total retries

For auto-resume from checkpoint, add FSx for Lustre volumes and modify checkpoint paths:

volumes:
  - name: fsx-storage
    persistentVolumeClaim:
      claimName: fsx-claim

volumeMounts:
  - name: fsx-storage
    mountPath: /fsx

# Update command args:
- '--checkpoint_dir=/fsx/checkpoints'
- '--resume_from_checkpoint=/fsx/checkpoints'

Submit the job:

envsubst < llama3_1_8b-fsdp-hpto.yaml | kubectl apply -f -

4. Monitor Training Jobs

Install kubetail for log monitoring:

curl -sL https://raw.githubusercontent.com/aws-samples/aws-do-eks/refs/heads/main/Container-Root/eks/ops/setup/install-kubetail.sh | sudo bash

View logs:

kubetail llama3

Describe the HyperPodPytorchJob:

kubectl describe hyperpodpytorchjob llama3-1-8b-fsdp

5. Testing Resiliency

Emulate an instance failure to test the operator's recovery capabilities:

export NODE=$(kubectl get nodes | awk 'NR>1 {print $1}' | shuf -n 1)
kubectl label node $NODE \
  sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot \
  --overwrite=true

Check the job status:

kubectl describe hyperpodpytorchjob

Expected output showing fault remediation:

Status:
  Conditions:
    Last Transition Time:  2025-08-04T22:06:18Z
    Status:                True
    Type:                  Created
    Last Transition Time:  2025-08-04T22:08:37Z
    Status:                True
    Type:                  PodsRunning
    Last Transition Time:  2025-08-04T22:08:44Z
    Message:               The fault of reason NodeFault was remediated in 94283 milliseconds.
    Reason:                Running
    Status:                True
    Type:                  Running
  Restart Count:           1
Events:
  Type     Reason     Age    From                             Message
  ----     ------     ----   ----                             -------
  Warning  NodeFault  117s   hyperpod-pytorch-job-controller  Found unhealthy node hyperpod-i-03d315d8cef22bd25
  Normal   Running    23s    hyperpod-pytorch-job-controller  The fault of reason NodeFault was remediated in 94283 milliseconds.

Log Monitoring Configuration Parameters

The following table describes all possible log monitoring configurations:

Parameter	Description
`jobMaxRetryCount`	Maximum number of restarts at the process level
`restartPolicy: numRestartBeforeFullJobRestart`	Maximum number of restarts at the process level before the operator restarts at the job level
`restartPolicy: evalPeriodSeconds`	The period of evaluating the restart limit in seconds
`restartPolicy: maxFullJobRestarts`	Maximum number of full job restarts before the job fails
`cleanPodPolicy`	Specifies the pods that the operator should clean. Accepted values are All, OnlyComplete, and None
`logMonitoringConfiguration`	The log monitoring rules for slow and hanging job detection
`expectedRecurringFrequencyInSeconds`	Time interval between two consecutive LogPattern matches after which the rule evaluates to HANGING
`expectedStartCutOffInSeconds`	Time to first LogPattern match after which the rule evaluates to HANGING
`logPattern`	Regular expression that identifies log lines that the rule applies to when the rule is active
`metricEvaluationDataPoints`	Number of consecutive times a rule must evaluate to SLOW before marking a job as SLOW
`metricThreshold`	Threshold for value extracted by LogPattern with a capturing group
`operator`	The inequality to apply to the monitoring configuration. Accepted values are gt, gteq, lt, lteq, and eq
`stopPattern`	Regular expression to identify the log line at which to deactivate the rule

Advanced Configuration Examples

Testing Custom Log Monitoring

To test custom log monitoring configurations, modify your job's logMonitoringConfiguration:

logMonitoringConfiguration:
  - name: JobStart
    logPattern: '.*Loss:.*'
    expectedStartCutOffInSeconds: 1  # Change from 240 to 1 for testing

This will trigger a LogStateHanging_JobStart error if training doesn't start within 1 second, allowing you to test the monitoring system.

HyperPod Elastic Agent Arguments

The HyperPod elastic agent supports all PyTorch ElasticAgent arguments plus additional ones:

Argument	Description	Default
`--shutdown-signal`	Signal to send to workers for shutdown	"SIGKILL"
`--shutdown-timeout`	Timeout between SIGTERM and SIGKILL	30
`--server-host`	Agent server address	"0.0.0.0"
`--server-port`	Agent server port	8080
`--server-log-level`	Agent server log level	"info"
`--server-shutdown-timeout`	Server shutdown timeout	300
`--pre-train-script`	Path to pre-training script	None
`--pre-train-args`	Arguments for pre-training script	None
`--post-train-script`	Path to post-training script	None
`--post-train-args`	Arguments for post-training script	None

Troubleshooting

Installation Issues

Incompatible HyperPod AMI: Update to the latest version using the UpdateClusterSoftware API.

Incompatible Task Governance Version: Ensure HyperPod task governance is version v1.3.0-eksbuild.1 or higher.

Missing Permissions: Verify IAM permissions are correctly set up for the EKS Pod Identity Agent.

Job Execution Issues

Jobs Not Starting: Check that the HyperPod elastic agent is properly installed in your training image.

Log Monitoring Not Working: Ensure training logs are emitted to sys.stdout and saved at /tmp/hyperpod/.

The key advantage of the HyperPod Training Operator is that jobs are restarted at the process level within container pods, rather than affecting all pods. This provides surgical recovery that keeps training running smoothly with minimal disruption.

Prerequisites​

Installation Methods​

SageMaker AI Console (Recommended)​

Amazon EKS Console​

AWS CLI​

Validate Installation​

Running Training Jobs​

1. Clone the awsome-distributed-training Repository​

2. Build and Push Docker Image​

3. Job Submission Methods​

3a. Job Submission via kubectl​

4. Monitor Training Jobs​

5. Testing Resiliency​

Log Monitoring Configuration Parameters​

Advanced Configuration Examples​

Testing Custom Log Monitoring​

HyperPod Elastic Agent Arguments​

Troubleshooting​

Installation Issues​

Job Execution Issues​