Troubleshooting Guide
This guide helps you diagnose and resolve common issues with your HyperPod cluster.
Quick Reference Table
| Issue Category | Orchestrator | Subject (Symptom) | Reason | Resolution | Link to Details |
|---|---|---|---|---|---|
| Deployment | Common | CloudFormation deployment failed, need detailed error | Nested stack structure hides root cause errors | Navigate through nested stacks to find failed resource | Details |
| Deployment | Common | Cluster creation fails with capacity error | Insufficient capacity, wrong availability zone | Use Flexible Training Plans or reserved capacity, verify AZ matches reservation | Details |
| Deployment | Common | Cluster creation fails with lifecycle script error | Script syntax errors, missing dependencies, S3 access issues | Review CloudWatch logs, verify S3 access, check script syntax | Details |
| Deployment | Common | EFA health checks did not run successfully | Missing security group self-referencing rule | Add outbound rule allowing all traffic to the security group itself | Details |
| Deployment | EKS | Cluster is InService but not seeing instances | Continuous Provisioning mode behavior, instance creation failures | Check cluster events for instance creation status and errors | Details |
| Deployment | EKS | Cannot access EKS cluster with kubectl | IAM identity not configured in EKS access entries | Add IAM identity to access entries, associate access policy | Details |
| Deployment | Common | SSM session not starting or getting error | SSM plugin not installed, wrong target format, incorrect region | Install SSM plugin, use HyperPod target format, verify region | Details |
| Node Management | Slurm | Node not responding / Slurm says node is "down" | Network issues, slurmd daemon stopped, resource exhaustion | Check connectivity, verify slurmd status, check memory/disk | Details |
| Node Management | Slurm | Node shows "Node unexpectedly rebooted" | Node rebooted without Slurm being notified, slurmd not running | Resume node after verifying it's healthy, check slurmd status | Details |
| Node Management | Slurm | Jobs stuck in PENDING/COMPLETING, nodes in wrong state | Controller cache issues, stale state, communication problems | Restart slurmctld to re-sync state | Details |
| Node Management | Common | Node replacement not happening automatically | Auto-recovery disabled, capacity unavailable, quota limits | Check auto-recovery settings, verify capacity, review quotas | Details |
| Node Management | Common | Node replacement not happening even after manual trigger | Wrong command syntax, cluster state, IAM permissions, capacity issues | Verify command syntax, check cluster state, review IAM permissions | Details |
| Performance | Common | NCCL timeouts | Network congestion, EFA issues, insufficient timeout value | Increase NCCL_TIMEOUT, verify EFA, check network connectivity | Details |
| Performance | Common | Uneven NCCL performance across nodes | Network topology differences, degraded EFA, instance variations | Check EFA bandwidth, verify instance types, use placement groups | Details |
| Performance | Common | Poor filesystem performance | Insufficient throughput, wrong volume type, I/O bottleneck | Check filesystem metrics, increase throughput, optimize I/O operations | Details |
| Memory | Common | "Cannot allocate memory" at os.fork() | Insufficient shared memory, huge pages not configured for EFA | Set FI_EFA_USE_HUGE_PAGE=0, increase --shm-size, reduce num_workers | Details |
| GPU | Common | Suspecting GPU failure | Hardware failure, ECC errors, thermal throttling | Run nvidia-smi diagnostics, check ECC errors, drain node | Details |
| GPU | Common | EFA/NCCL/CUDA/driver version mismatch | Incompatible versions, host/container mismatch | Check version compatibility, rebuild containers with matching versions | Details |
| Storage | Common | Root volume exhausted, need to expand | Root volume limited to 100GB, cannot be expanded | Use secondary EBS (/opt/sagemaker), NVMe (/opt/dlami/nvme), FSx, or S3 | Details |
| Utilities | Slurm | Need to find instance ID from node name | Node names use IP format, AWS operations need instance ID | Query resource_config.json or use HyperPod APIs | Details |
Troubleshooting Details
Deployment Issues
Finding Detailed CloudFormation Error Messages
Orchestrator: Common (Slurm, EKS)
Issue: HyperPod cluster deployment via management console fails, but error message is not detailed enough to identify root cause
Background: When you deploy a HyperPod cluster using the HyperPod management console, it creates a CloudFormation stack behind the scenes. This stack uses nested stacks to organize resources. The most relevant error message for the root cause is often buried in the nested stacks at the individual AWS resource level, not at the top-level stack.
Resolution Steps:
-
Navigate to CloudFormation console:
- Go to https://console.aws.amazon.com/cloudformation
- Ensure you're in the correct region where the cluster was being deployed
-
Find the HyperPod stack:
- Look for a stack with a name related to your cluster
- The stack status will show as "CREATE_FAILED" or "ROLLBACK_COMPLETE"
-
Check the Events tab:
- Click on the failed stack
- Go to the "Events" tab
- Look for events with status "CREATE_FAILED"
- Note: The error at this level may be generic like "Embedded stack failed"
-
Navigate to nested stacks:
- In the "Resources" tab, look for resources of type "AWS::CloudFormation::Stack"
- These are nested stacks
- Click on the Physical ID (stack name) of any nested stack that shows "CREATE_FAILED" status
- This will open the nested stack in a new view
-
Drill down through nested stacks:
- Repeat step 4 for each level of nesting
- Continue drilling down until you find a stack with no nested stacks, only AWS resources
- Look for the specific resource that failed (not another nested stack)
-
Find the failed resource:
- In the deepest nested stack, go to the "Events" tab
- Look for the specific AWS resource that failed (e.g., AWS::SageMaker::Cluster, AWS::IAM::Role, AWS::Lambda::Function)
- The "Status reason" column will show the detailed error message
- This is typically the most useful error message for troubleshooting
-
Common resource types and their errors:
- AWS::SageMaker::Cluster: Capacity errors, subnet issues, security group problems, lifecycle script failures
- AWS::IAM::Role: Permission errors, trust relationship issues
- AWS::Lambda::Function: Execution errors, timeout issues
- AWS::EC2::VPC: CIDR conflicts, quota limits
- Custom::Resource: Lambda-backed custom resource errors (check Lambda logs)
Tips:
- Use the search/filter: In the Events tab, you can filter by "Failed" status to quickly find errors
- Check timestamps: Look at the most recent failed events
- Multiple failures: If multiple resources failed, start with the earliest failure - later failures may be cascading effects
- Custom resources: If a Custom::Resource fails, check the associated Lambda function's CloudWatch logs for detailed error messages
- Copy error messages: Copy the full error message for searching documentation or contacting support
Cluster Creation Failing with Capacity Error
Orchestrator: Common (Slurm, EKS)
Issue: Cluster creation fails with insufficient capacity error
Common Error Messages:
- "Insufficient capacity"
- "We currently do not have sufficient capacity in the Availability Zone you requested"
- "Cannot provision requested instances"
Background: Depending on the instance type, region, and availability zone you choose, it can be challenging to allocate requested capacity on-demand, especially for large instance types (p4d, p5, etc.). Additionally, on-demand instances are not necessarily allocated in close proximity, which can impact network performance for distributed training workloads.
Capacity Reservation Options:
HyperPod supports three options for securing compute capacity:
1. On-Demand Instances
- Best for: Small instance types, short-term usage, experimental workloads
- Pros: No upfront commitment, immediate availability for common instance types
- Cons:
- Not guaranteed for large instance types
- Instances may not be in close proximity (suboptimal network topology)
- Not recommended for production workloads
- Higher cost compared to reserved options
2. Flexible Training Plans
- Best for: Medium to large workloads with predictable schedules
- How it works:
- Query available capacity by instance type, instance count, and desired schedule
- Self-service purchase at discounted prices
- Capacity duration up to 180 days
- Pros:
- Guaranteed capacity for the reserved period
- Discounted pricing compared to on-demand
- Better network topology (instances allocated together)
- Cons: Requires planning ahead and commitment
3. Reserved Capacity via AWS Account Team
- Best for: Large-scale, long-term capacity needs
- How it works: Contact your AWS account team to reserve capacity
- Pros:
- Best option for large or long-term capacity reservations
- Guaranteed capacity and optimal placement
- Customized solutions for specific requirements
- Cons: Requires engagement with account team and longer lead time
Resolution Steps:
-
If using On-Demand and facing capacity errors:
- Consider switching to Flexible Training Plans for guaranteed capacity
- Try different availability zones within your region
- Consider smaller instance types or fewer instances
- Contact your AWS account team for capacity reservation options
-
If using Flexible Training Plans or Reserved Capacity and still facing errors:
- Verify TrainingPlanArn is specified: For Flexible Training Plans, ensure you specified the TrainingPlanArn field in your cluster configuration with the ARN of the purchased training plan
- Verify the availability zone: Ensure your instance group configuration specifies the correct availability zone where capacity was reserved
- Verify the subnet ID corresponds to the availability zone where capacity was reserved
- Contact your AWS account team to confirm the reservation details
Cluster Creation Failed with Lifecycle Script Execution Error
Orchestrator: Common (Slurm, EKS)
Issue: HyperPod cluster creation fails during lifecycle script execution
Common Causes:
- Syntax errors in lifecycle scripts
- Missing dependencies or packages
- S3 access issues for script retrieval
- Insufficient permissions for script operations
- Network connectivity problems
Resolution Steps:
- Check CloudWatch logs for the cluster creation process:
- Log Group:
/aws/sagemaker/Clusters/<cluster-name>/<cluster-id>- Example:
/aws/sagemaker/Clusters/k8-3/gyazigf6kqq9
- Example:
- Log Stream:
LifecycleConfig/<node-group-name>/<instance-id>- Example:
LifecycleConfig/group-g5-8x/i-0df4aefe56f4ef3bc
- Example:
- Look for error messages, stack traces, or failed commands in the logs
- Log Group:
- If logs are not available or empty, verify IAM permissions:
- Check if the IAM execution role has CloudWatch Logs write permissions
- Verify the IAM role has permissions to access the S3 bucket where lifecycle scripts are stored:
- S3 read permissions (s3:GetObject, s3:ListBucket)
- Confirm the S3 path is correct in cluster configuration
- Check bucket permissions and IAM role policies
- Ensure the S3 bucket policy allows the IAM role to read objects
- Check for updated versions of default lifecycle scripts:
- The lifecycle script version you're using may have known issues that have been fixed
- Compare your scripts with the latest versions:
- HyperPod EKS: https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config
- HyperPod Slurm: https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config
- Review the commit history for bug fixes and improvements
- Update to the latest version if available
- Review script syntax and test locally if possible
- Verify the script uses Linux line endings (LF, not CRLF):
- Scripts created on Windows may have CRLF line endings which cause execution failures on Linux
- Convert to LF using:
dos2unix script.shor your text editor's line ending conversion - Check line endings:
file script.sh(should show "ASCII text" not "ASCII text, with CRLF line terminators")
- Ensure script has proper shebang (e.g.,
#!/bin/bash)
EFA Health Checks Did Not Run Successfully
Orchestrator: Common (Slurm, EKS)
Issue: Cluster creation fails with error "EFA health checks did not run successfully. Ensure that your VPC and security groups are properly configured before attempting to create a new cluster."
Common Cause:
- Security group is missing a self-referencing outbound rule that allows nodes to communicate with each other via EFA
Resolution Steps:
-
Identify the security group used for the HyperPod cluster
-
Add the required outbound rules to the security group:
-
Rule 1 - Intra-SG Communication (Required for EFA):
- Type: All traffic
- Protocol: All (-1)
- Destination: The security group itself (self-referencing)
- Description: Allow traffic within the security group
-
Rule 2 - Internet Access:
- Type: All traffic
- Protocol: All (-1)
- Destination: 0.0.0.0/0
- Description: Allow traffic to internet (for AWS API calls, package downloads, etc.)
-
-
Verify the security group has the following inbound rules:
- Intra-SG Communication:
- Type: All traffic
- Protocol: All (-1)
- Source: The security group itself (self-referencing)
- Intra-SG Communication:
-
Ensure all nodes in the cluster use the same security group
-
After fixing the security group, retry cluster creation
Reference Configuration:
See the CloudFormation template at eks/cloudformation/security-group-template.yaml for the complete security group setup used by HyperPod.
Prevention:
- Always include self-referencing rules (both inbound and outbound) when creating security groups for HyperPod clusters
- Use the provided CloudFormation templates which include proper security group configuration
- Test security group configuration before cluster creation
Cluster is InService Status but Not Seeing Instances
Orchestrator: EKS
Issue: Cluster shows "InService" status but instances are not visible or not being created
Common Cause: This is expected behavior when using Continuous Provisioning mode (available for HyperPod EKS only). In this mode:
- The cluster transitions to "InService" status before all instances are created
- Instance creation happens asynchronously after the cluster becomes InService
- Instance creation failures are not reported as cluster or instance group creation failures
Note: Continuous Provisioning mode and cluster events are available for HyperPod EKS only. These features are not yet available for HyperPod Slurm as of January 2026.
Resolution Steps:
- Check cluster events for instance creation status:
- Via Management Console: Navigate to https://console.aws.amazon.com/sagemaker/home#/cluster-management → Select your cluster → Events tab
- Via AWS CLI:
aws sagemaker list-cluster-events --cluster-name <cluster-name> - Look for events related to instance creation, provisioning status, and any error messages
- Verify the cluster provisioning mode:
Look for the provisioning configuration to confirm if Continuous Provisioning is enabled
aws sagemaker describe-cluster --cluster-name <cluster-name> - Check HyperPod cluster node status:
- Via AWS CLI:
aws sagemaker list-cluster-nodes --cluster-name <cluster-name> - Via Management Console: Navigate to https://console.aws.amazon.com/sagemaker/home#/cluster-management → Select your cluster → View node details
- Look for node health status, instance state, and creation timestamps
- Via AWS CLI:
- Review CloudWatch logs for instance creation attempts:
- Log Group:
/aws/sagemaker/Clusters/<cluster-name>/<cluster-id> - Check for recent log streams from lifecycle scripts:
LifecycleConfig/<node-group-name>/<instance-id> - Look for errors during instance provisioning or lifecycle script execution
- Log Group:
- If instances are failing to create, check for common issues:
- Insufficient capacity in the selected availability zones
- Lifecycle script errors (see Cluster Creation Failed with Lifecycle Script Execution Error)
- IAM permission issues
- Service quotas or limits
Understanding Continuous Provisioning Mode:
- Allows the cluster to become operational even if some instances fail to provision
- Provides faster cluster availability for partial deployments
- Requires monitoring cluster events and node status to track instance creation progress
- Failed instances can be replaced individually without affecting the overall cluster status
Cannot Access EKS Cluster with kubectl
Orchestrator: EKS
Issue: Unable to access HyperPod EKS cluster using kubectl, receiving authentication or authorization errors
Common Error Messages:
- "couldn't get current server API group list: the server has asked for the client to provide credentials"
Common Cause: When using EKS's "IAM access entries" for access control, the IAM identity (user or role) you are using must be correctly configured in the access entries. If your IAM identity is not added or misconfigured, kubectl commands will fail with authentication or authorization errors.
Resolution Steps:
-
Verify your current IAM identity:
aws sts get-caller-identityNote the ARN of the identity you're using (user or role)
-
Configure access entries via EKS Console:
- Navigate to https://console.aws.amazon.com/eks/clusters
- Select your HyperPod EKS cluster
- Go to the "Access" tab
- Under "IAM access entries", check if your IAM identity is listed
- If not present, click "Create access entry":
- Enter your IAM principal ARN
- Select access policy (e.g., AmazonEKSClusterAdminPolicy for full access)
- Choose access scope (cluster-wide recommended)
- Click "Create"
- If already present, verify the configuration:
- Check that the access policies are correctly associated (e.g., AmazonEKSClusterAdminPolicy for full access)
- Verify the namespace configuration if using namespace-scoped access
-
Update kubeconfig (if not already configured):
aws eks update-kubeconfig \
--name <cluster-name> \
--region <region> -
Test access:
kubectl get nodes
kubectl get pods -A
Note:
- Access entries are the recommended method for managing EKS cluster access
- Ensure the IAM identity has the necessary EKS permissions in IAM policies
- Changes to access entries may take a few moments to propagate
SSM Session Not Starting or Getting Error
Orchestrator: Common (Slurm, EKS)
Issue: Unable to start SSM session to HyperPod cluster nodes or receiving errors
Common Causes:
- SSM plugin not installed on development machine
- Incorrect SSM target name format
- Wrong AWS region configuration
Resolution Steps:
-
Install the AWS Systems Manager Session Manager plugin on your development machine:
- Follow the official installation guide: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html
- Verify installation:
session-manager-plugin --version
-
Use the correct HyperPod-specific SSM target name format:
- Standard format:
sagemaker-cluster:<cluster-name>_<instance-group-name>-<instance-id> - Example:
sagemaker-cluster:my-cluster_worker-group-i-0abc123def456789 - Command:
aws ssm start-session --target sagemaker-cluster:<cluster-name>_<instance-group-name>-<instance-id> - Note: Do NOT use the EC2 instance ID directly (e.g.,
i-0abc123def456789) - you must use the HyperPod target format
- Standard format:
-
Verify the AWS region is correctly configured:
- Check your AWS CLI profile's default region:
aws configure get region - Or set the region explicitly using environment variables:
export AWS_REGION=us-west-2
export AWS_DEFAULT_REGION=us-west-2 - Or specify region in the command:
aws ssm start-session --target <target> --region us-west-2 - Ensure the region matches where your HyperPod cluster is deployed
- Check your AWS CLI profile's default region:
-
Verify IAM permissions for SSM access:
- Your IAM user/role needs the following permissions:
ssm:StartSessionsagemaker:DescribeClustersagemaker:ListClusterNodes
- The cluster nodes must have the SSM agent running and proper IAM role attached
- Your IAM user/role needs the following permissions:
-
Check if the instance is running and accessible:
aws sagemaker list-cluster-nodes --cluster-name <cluster-name>Verify the instance status is "Running" or "InService"
-
Test connectivity with verbose output:
aws ssm start-session --target <target> --debugReview the debug output for specific error messages
Common Error Messages:
- "Target is not connected": Instance may be stopped, SSM agent not running, network connectivity issues, or incorrect target name format
- "Access denied": Verify IAM permissions for both your user and the instance role
SSH over SSM:
Important: Before using SSH, you must add your SSH public key to the ~/.ssh/authorized_keys file on the target node.
You can configure SSH to use SSM by adding entries to your SSH config file (~/.ssh/config):
Host my-cluster-controller
HostName sagemaker-cluster:abcdfe1234_controller-i-0abc123def456789
User ubuntu
IdentityFile ~/keys/my-key.pem
ProxyCommand aws --profile default --region us-west-2 ssm start-session --target %h --document-name AWS-StartSSHSession --parameters portNumber=%p
Then connect simply with:
ssh my-cluster-controller
Helpful Tool:
For easier SSM session management with HyperPod clusters, consider using the hyperpod_ssm tool:
- Repository: https://github.com/shimomut/sagemaker-solutions/tree/main/hyperpod_ssm
- Simplifies SSM target name construction and session management
- Provides convenient commands for listing nodes and starting sessions
- Handles the HyperPod-specific target format automatically
Node Management Issues
Node Not Responding / Slurm Says Node is "Down"
Orchestrator: Slurm
Issue: Slurm node becomes unresponsive or shows as "down"
Resolution Steps:
-
Check node status:
sinfo -N -lorscontrol show node <node-name> -
If node shows "down" status, check the reason message:
sinfo -o "%N %T %30E"This will display the node name, state, and reason for the current state
-
Check HyperPod cluster node status:
- Via AWS CLI:
aws sagemaker list-cluster-nodes --cluster-name <cluster-name> - Via Management Console: Navigate to https://console.aws.amazon.com/sagemaker/home#/cluster-management → Select your cluster → View node details
- Look for node health status, instance state, and any error messages
- Via AWS CLI:
-
Test connectivity to the node using multiple methods to identify what's working:
- PING:
ping <node-ip-or-hostname> - Cross-node SSH: From another node, try
ssh <node-ip-or-hostname> - SSM Session: See SSM Session Not Starting or Getting Error for the correct HyperPod target format
- Slurm srun:
srun -w <node-name> hostname
By testing these variations, you can determine which communication paths are functional
- PING:
-
If you can access the node, check system logs:
sudo journalctl -xe -
Verify slurmd daemon is running:
sudo systemctl status slurmd -
Check for out-of-memory or disk space issues:
free -handdf -h -
If disk space is full, identify what is consuming space:
# Check disk usage by filesystem
df -h
# Find large directories
sudo du -h --max-depth=1 / | sort -hr | head -20
# Check common locations for large files
sudo du -sh /var/log/* | sort -hr
sudo du -sh /tmp/* | sort -hr
sudo du -sh /home/*/* | sort -hr -
Clean up disk space if needed:
- Delete old log files:
sudo rm -f /var/log/*.log.* /var/log/*/*.gz - Clear temporary files:
sudo rm -rf /tmp/* - Clean package manager cache:
sudo apt-get clean(Slurm) orsudo yum clean all(EKS) - Remove old container images if using Docker:
docker system prune -a
- Delete old log files:
-
Restart slurmd if needed:
sudo systemctl restart slurmd -
If node remains down, set it back to idle:
scontrol update nodename=<node-name> state=resume -
If none of the above steps resolve the issue, reboot the instance:
aws sagemaker batch-reboot-cluster-nodes \
--cluster-name <cluster-name> \
--node-ids <instance-id>
- If rebooting doesn't help, replace the node:
aws sagemaker batch-replace-cluster-nodes \
--cluster-name <cluster-name> \
--node-ids <instance-id>
Node Unexpectedly Rebooted
Orchestrator: Slurm
Issue: Slurm node shows as "down" with reason "Node unexpectedly rebooted"
Common Symptoms:
- Node appears as "down" in
sinfooutput - Reason message shows "Node unexpectedly rebooted"
- Node is actually running and accessible, but Slurm won't schedule jobs on it
Common Causes:
- Node was rebooted (manually or automatically) without notifying Slurm
- slurmd daemon stopped or crashed during reboot
- slurmd failed to start after reboot
- Network interruption during reboot prevented slurmd from re-registering with slurmctld
Diagnostic Steps:
-
Check node status and reason:
sinfo -N -l
scontrol show node <node-name>Look for "Reason=Node unexpectedly rebooted"
-
Verify node is actually running:
# Try to ping the node
ping <node-ip>
# Try to SSH to the node
ssh <node-name> -
Check if slurmd is running on the node:
# On the affected node
sudo systemctl status slurmd -
Check slurmd logs for errors:
# On the affected node
sudo journalctl -u slurmd -n 100
Resolution Steps:
-
If slurmd is not running, start it:
# On the affected node
sudo systemctl start slurmd
sudo systemctl status slurmd -
Resume the node in Slurm:
# On the head node
scontrol update nodename=<node-name> state=resume -
Verify node is back to idle state:
sinfo -N -l | grep <node-name>The node should now show as "idle" or "alloc" instead of "down"
-
If node still shows as down, check for other issues:
# Check if node can communicate with controller
scontrol ping
# Check node configuration
scontrol show node <node-name>
Prevention:
To avoid this issue in the future:
- Ensure slurmd is configured to start automatically on boot:
sudo systemctl enable slurmd - When rebooting nodes intentionally, drain them first:
scontrol update nodename=<node-name> state=drain reason="Planned reboot"
# Reboot the node
# After reboot, resume the node
scontrol update nodename=<node-name> state=resume - Use HyperPod's batch-reboot-cluster-nodes command for managed reboots:
aws sagemaker batch-reboot-cluster-nodes \
--cluster-name <cluster-name> \
--node-ids <instance-id>
Note:
- This is a protective mechanism in Slurm to prevent scheduling jobs on nodes that may have lost state during an unexpected reboot
- Always verify the node is healthy before resuming it
- If the node continues to have issues, consider replacing it instead of resuming
Jobs Stuck in PENDING/COMPLETING, Nodes in Wrong State
Orchestrator: Slurm
Issue: Jobs stuck in PENDING or COMPLETING state, nodes showing incorrect states, or Slurm controller not responding properly
Background: The slurmctld (Slurm Central Control Daemon) manages job scheduling, resource allocation, and communication with compute nodes. By design, slurmctld saves state to disk and restores it upon restart, allowing maintenance without losing pending or running jobs. Restarting slurmctld is a common fix for various controller-related issues.
When to Restart slurmctld:
1. Job Scheduling and Resource Allocation Issues
- Jobs stuck in PENDING with REASON=RESOURCES: Jobs remain queued despite available nodes. Restart forces queue re-evaluation
- GRES (GPU/EFA) miscalculation: Resources not released back to pool after job completion, causing future jobs to hang
- Jobs stuck in COMPLETING state: Jobs remain in COMPLETING indefinitely, especially after instance replacements. The controller "memorizes" the COMPLETING state and continues waiting even after node replacement
2. Node State Problems
- Nodes stuck in "Unknown" or "Down" state: Nodes returned from reboot but controller still thinks they're unavailable
- Compute node communication failures: slurmctld stops responding to
scontrol pingor nodes can't communicate with head node - Node configuration changes: After adding new nodes or changing processor counts
3. Configuration Changes
- Applying slurm.conf changes: After updating topology.conf or slurm.conf files, especially TCP listening settings or node additions/removals
- After reconfiguration commands: Following
scontrol reconfigure, particularly for topology updates after node relaunches
4. Controller Unresponsiveness
- slurmctld hangs or deadlocks: Daemon becomes overwhelmed or unresponsive
- Plugin/database issues: Lost connection to slurmdbd or invalid RPC errors
- Race conditions: Specific version bugs causing daemon malfunction
How to Restart:
-
Standard restart:
sudo systemctl restart slurmctld -
Verify service status:
sudo systemctl status slurmctld -
Check logs for issues:
sudo journalctl -u slurmctld -n 100 -
If controller is completely hung (kill and restart):
sudo systemctl stop slurmctld
sudo pkill -9 slurmctld # If stop doesn't work
sudo systemctl start slurmctld
Important Notes:
- State preservation: By default, slurmctld restarts with state preservation - running jobs continue
- Clean start (use with caution): If state file is corrupted, use
slurmctld -cto purge all running jobs and node states - Verify after restart: Check that nodes are in expected states and jobs are running properly:
sinfo
squeue
scontrol show config | grep StateSaveLocation
What Gets Preserved:
- Running jobs continue execution
- Pending jobs remain in queue
- Node states are restored from saved state
- Job history and accounting data
What Gets Reset:
- Controller memory cache
- Stale communication channels
- Hung internal processes
- Resource allocation calculations
Node Replacement Not Happening Automatically
Orchestrator: Common (Slurm, EKS)
Issue: Failed nodes are not being automatically replaced by HyperPod
Resolution Steps:
- Check HyperPod cluster auto-recovery settings in SageMaker console or via CLI:
Look for the auto-recovery configuration
aws sagemaker describe-cluster --cluster-name <cluster-name> - Verify cluster is not in a failed state that prevents recovery
- Check cluster events for auto-recovery information:
- Via Management Console: Navigate to https://console.aws.amazon.com/sagemaker/home#/cluster-management → Select your cluster → Events tab
- Via AWS CLI:
aws sagemaker list-cluster-events --cluster-name <cluster-name> - Look for events related to node health, replacement attempts, and any failures
- Note: Cluster events are available for HyperPod EKS. For HyperPod Slurm, this feature is not yet available as of January 2026
- Check if HyperPod's health monitoring agent detected an issue and triggered resiliency actions:
-
Check CloudWatch Logs for health monitoring agent:
- Log Group:
/aws/sagemaker/Clusters/<cluster-name>/<cluster-id> - Log Stream:
SagemakerHealthMonitoringAgent/<node-group-name>/<instance-id> - Example:
SagemakerHealthMonitoringAgent/group-g5-8x/i-0aa017cbf6c240f3f - Look for detected issues and triggered actions
- Log Group:
-
For HyperPod Slurm: Check if the node reason message indicates a resiliency action:
sinfo -o "%N %T %30E"The reason message must be exactly "Action:Reboot" or "Action:Replace" for auto-recovery to trigger
-
For HyperPod EKS: Check node labels for resiliency actions:
kubectl get nodes --show-labels
kubectl describe node <node-name>Look for the following labels indicating resiliency actions have been triggered:
sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement- Node is marked for replacementsagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot- Node is marked for reboot
See: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-node-labels.html
-
- Review CloudWatch logs for auto-recovery attempts:
- Log Group:
/aws/sagemaker/Clusters/<cluster-name>/<cluster-id> - Check for recent log streams from lifecycle scripts:
LifecycleConfig/<node-group-name>/<instance-id> - If lifecycle script fails during auto-recovery, the new instance cannot be created and auto-recovery will fail
- Look for error messages in the lifecycle script logs that might prevent successful node replacement
- Log Group:
- Confirm capacity is available for replacement instances in the selected availability zones
- If you need to immediately recover from the failed instance, trigger manual reboot or replacement:
- Manual reboot:
aws sagemaker batch-reboot-cluster-nodes \
--cluster-name <cluster-name> \
--node-ids <instance-id> - Manual replacement:
aws sagemaker batch-replace-cluster-nodes \
--cluster-name <cluster-name> \
--node-ids <instance-id>
- Manual reboot:
Node Replacement Not Happening Even After Manual Trigger
Orchestrator: Common (Slurm, EKS)
Issue: Manual node replacement command fails or doesn't complete
Resolution Steps:
- Use the recommended batch commands instead of legacy methods:
- Recommended: Use
batch-replace-cluster-nodesorbatch-reboot-cluster-nodescommands - Legacy methods (not recommended): Setting node status in Slurm or node labels in Kubernetes
- The new batch commands provide clear success/failure messages indicating whether the service accepted the request
- Recommended: Use
- Check HyperPod cluster node status:
- Via AWS CLI:
aws sagemaker list-cluster-nodes --cluster-name <cluster-name> - Via Management Console: Navigate to https://console.aws.amazon.com/sagemaker/home#/cluster-management → Select your cluster → View node details
- Look for node health status, instance state, and any error messages
- Via AWS CLI:
- Check cluster events for replacement information:
- Via Management Console: Navigate to https://console.aws.amazon.com/sagemaker/home#/cluster-management → Select your cluster → Events tab
- Via AWS CLI:
aws sagemaker list-cluster-events --cluster-name <cluster-name> - Look for events related to the replacement request, node status changes, and any error messages
- Note: Cluster events are available for HyperPod EKS. For HyperPod Slurm, this feature is not yet available as of January 2026
- Verify the replacement command syntax:
aws sagemaker batch-replace-cluster-nodes \
--cluster-name <cluster-name> \
--node-ids <instance-id> - Check the command output for error messages
- Verify the instance ID is correct and belongs to the cluster:
aws sagemaker list-cluster-nodes --cluster-name <cluster-name> - Ensure the cluster is in a state that allows node replacement (not in "Creating" or "Deleting" state)
- Review CloudWatch logs for replacement attempts:
- Log Group:
/aws/sagemaker/Clusters/<cluster-name>/<cluster-id> - Check for recent log streams from lifecycle scripts:
LifecycleConfig/<node-group-name>/<instance-id> - If lifecycle script fails during replacement, the new instance cannot be created and replacement will fail
- Look for error messages in the lifecycle script logs that might prevent successful node replacement
- Log Group:
- Verify capacity is available for the instance type in the target availability zone
GPU and Accelerator Issues
Suspecting GPU Failure
Orchestrator: Common (Slurm, EKS)
Issue: Training jobs fail or produce incorrect results, GPU errors in logs
Common Symptoms:
- CUDA errors in application logs
- Training produces NaN or incorrect results
- GPU memory errors or allocation failures
- System crashes during GPU-intensive operations
- High temperatures or thermal throttling
Diagnostic Steps:
- Check GPU status:
nvidia-smi -qand look for errors - Check for ECC errors:
nvidia-smi -q | grep -A 5 "ECC Errors" - Monitor temperature and power:
nvidia-smi dmon -s pucvmet - Run DCGM diagnostic tests for comprehensive validation
- Run GPU burn tests to stress test under sustained load
- Monitor for thermal throttling and memory errors during stress tests
Resolution Steps:
- Document baseline thermal and performance characteristics
- If GPU shows errors or high temperatures, drain the node from scheduler
- Analyze temperature, power draw, and performance consistency
- Document GPU serial number, error details, and test results
- Contact AWS Support for hardware replacement
- Replace the node once new hardware is available
Detailed Guides:
EFA/NCCL/CUDA/Nvidia Driver Version Mismatch
Orchestrator: Common (Slurm, EKS)
Issue: Training fails with EFA or NCCL errors, performance degradation
Common Symptoms:
- NCCL initialization failures
- EFA device not found errors
- CUDA device not initialized
- Unexpected performance drops
- Segmentation faults during distributed training
- Training works on host but fails in container, or vice versa
Common Causes:
- Incompatible versions between CUDA, NCCL, EFA, and drivers
- CUDA driver and nvcc compiler version mismatch
- Mismatch between host and container environments
- Missing or incorrectly mounted EFA libraries in containers
- Different PyTorch/TensorFlow versions between host and container
Diagnostic Steps:
- Run PyTorch environment validation to check CUDA, NCCL, MPI availability
- Run EFA validation script to check EFA installer, libfabric, AWS OFI NCCL versions
- Check CUDA driver vs compiler version:
nvidia-smivsnvcc --version - Verify NVLink status and topology:
nvidia-smi nvlink --status - Compare versions between host and container environments
- Check if EFA interfaces are found and properly configured
Resolution Steps:
- Ensure CUDA driver and nvcc compiler versions match
- Check version compatibility documentation:
- EFA installer (including libnccl-ofi) and NCCL compatibility: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-changelog.html
- NVIDIA driver and CUDA toolkit compatibility: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/
- Verify version compatibility using the EFA compatibility matrix
- For containers: mount EFA libraries and devices properly
- Verify LD_LIBRARY_PATH includes EFA and CUDA libraries
- Initialize CUDA devices if needed (may require reboot)
- Match PyTorch/TensorFlow versions between host and container
- Rebuild containers with compatible versions from the compatibility matrix
Detailed Guides:
Performance Issues
NCCL Timeouts
Orchestrator: Common (Slurm, EKS)
Issue: Distributed training fails with NCCL timeout errors
Common Error Messages:
- "NCCL timeout in call to..."
- "NCCL communicator was aborted"
- "Net/IB : Got completion with error"
Diagnostic Steps:
- Enable NCCL debug logging:
export NCCL_DEBUG=INFO - Verify EFA adapters are working:
fi_info -p efa - Run pairwise NCCL tests between nodes to identify problematic connections
- Check for security group restrictions blocking inter-node traffic
- Monitor for test failures or hangs that indicate network issues
Resolution Steps:
- Increase NCCL timeout if needed:
export NCCL_TIMEOUT=3600 - Verify EFA is being used:
export FI_EFA_USE_DEVICE_RDMA=1 - Check and fix security group rules to allow all traffic between nodes
- Isolate and drain problematic nodes showing low bandwidth
- Reduce batch size or adjust parallelism if memory pressure exists
Detailed Guides:
Uneven NCCL Performance Depending on the Set of Nodes
Orchestrator: Common (Slurm, EKS)
Issue: Training performance varies significantly based on which nodes are allocated
Common Causes:
- Network topology differences between nodes
- Degraded EFA performance on some nodes
- Mixed instance types or generations
- CPU frequency scaling differences
Diagnostic Steps:
- Check network topology:
nvidia-smi topo -m - Verify EFA configuration on all nodes:
fi_info -p efa - Run pairwise NCCL bandwidth tests to identify slow node pairs
- Check for mixed instance types or generations
- Monitor for inconsistent results across multiple test runs
Resolution Steps:
- Run comprehensive NCCL all-reduce tests across all nodes
- Use topology-aware testing scripts to systematically identify bad nodes
- Check failed jobs and isolate problematic nodes
- Configure EFA optimization settings and GPU affinity
- Drain underperforming nodes and use placement groups for consistency
Detailed Guides:
Poor Filesystem Performance
Orchestrator: Common (Slurm, EKS)
Issue: Slow I/O operations, training bottlenecked by data loading, checkpoint saving, or loading executables and scripts
Resolution Steps:
-
Check performance metrics on CloudWatch:
- Navigate to CloudWatch console and select your filesystem
- Monitor key metrics: IOPS, throughput, data read/write bytes
- Look for metrics hitting their limits or showing sustained high usage
-
Check provisioned performance configuration:
- FSx for Lustre: Review throughput per TiB setting
- FSx for OpenZFS: Check provisioned IOPS and throughput
- EBS volumes: Verify volume type (gp3, io2) and provisioned IOPS/throughput
- Compare current configuration against your workload requirements
-
Investigate bottlenecks:
- If metrics show bottlenecks, identify what operations are causing high I/O:
- Check which processes or jobs are performing heavy I/O
- Review application logs for I/O patterns
- Use filesystem-specific monitoring tools
- Determine if the bottleneck is legitimate workload demand or inefficient I/O patterns
- If metrics show bottlenecks, identify what operations are causing high I/O:
-
Consider upgrading provisioned performance:
- If workload legitimately needs more performance, increase:
- FSx for Lustre: Increase storage capacity (throughput scales with size)
- FSx for OpenZFS: Increase provisioned IOPS/throughput
- EBS: Upgrade volume type or increase provisioned IOPS/throughput
- If workload legitimately needs more performance, increase:
-
Understand filesystem performance characteristics:
- AWS offers multiple filesystem options with different characteristics:
- FSx for Lustre: High-performance parallel filesystem, best for large sequential I/O
- FSx for OpenZFS: Good for mixed workloads, supports snapshots and cloning
- EBS: Block storage, good for single-instance workloads
- Instance store (NVMe): Highest performance but non-persistent
- Choose the filesystem that matches your I/O patterns
- AWS offers multiple filesystem options with different characteristics:
-
Consider switching filesystem type:
- For HyperPod Slurm: The default lifecycle script offers an option to use FSx for OpenZFS instead of Lustre for home directories
- Evaluate if a different filesystem type better suits your workload:
- Small random I/O → Consider OpenZFS
- Large sequential I/O → Lustre is optimal
- Temporary high-performance data → Use NVMe instance storage
Memory Issues
"Cannot Allocate Memory" Error at os.fork()
Orchestrator: Common (Slurm, EKS)
Issue: Training fails with "OSError: [Errno 12] Cannot allocate memory" during os.fork() operations
Common Symptoms:
- PyTorch DataLoader with multiple workers fails when forking processes
- Error occurs specifically at
os.fork()call - "Failed to register memory" errors during EFA initialization
- Segmentation faults during NCCL operations
- Training crashes when using EFA with multi-process data loading
Common Causes:
- Insufficient shared memory (/dev/shm) for forked processes
- Huge pages not configured properly, causing EFA memory registration to fail during fork
- Too many DataLoader workers attempting to fork
- Large memory footprint in parent process before fork
Resolution Steps:
- Set
FI_EFA_USE_HUGE_PAGE=0environment variable:Add to job script, container environment, orexport FI_EFA_USE_HUGE_PAGE=0/etc/environmentfor persistent setting - Increase shared memory size for containers:
# For Docker containers
docker run --shm-size=8g ...
# For Kubernetes pods, add to pod spec:
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 8Gi - Reduce number of DataLoader workers:
num_workers=4instead of higher values - Reduce batch size to lower memory pressure
- Use
persistent_workers=Trueto avoid recreating workers - Set
pin_memory=Falseif not needed - Check available memory:
free -hand/dev/shmusage:df -h /dev/shm - Verify huge pages configuration:
cat /proc/meminfo | grep Huge - If huge pages are needed for other workloads, configure them properly:
Only use
# Check current huge pages
cat /proc/sys/vm/nr_hugepages
# Set huge pages (requires root)
echo 1024 | sudo tee /proc/sys/vm/nr_hugepagesFI_EFA_USE_HUGE_PAGE=1if huge pages are properly configured
Storage Management
Root Volume Exhausted - How to Expand Storage
Orchestrator: Common (Slurm, EKS)
Issue: Running out of disk space on the root volume, need more storage capacity
Important to Know: You cannot configure the size of the primary EBS root volume in HyperPod - it is fixed at 100GB. This applies to all HyperPod clusters, and there is no way to change this size even when creating a new cluster.
Available Storage Options:
HyperPod provides alternative storage locations that you should use instead of the root volume:
-
Secondary EBS Volume (Configurable per instance group)
- Mount point:
/opt/sagemaker - Size is configurable for each instance group
- Can be configured when creating new instance groups (even after cluster creation)
- Mount point:
-
NVMe Instance Storage (Available on large instance types)
- Mount point:
/opt/dlami/nvme - High-performance local storage
- Available on instance types like p4d, p5, etc.
- Mount point:
-
FSx for Lustre Filesystem
- Shared across all cluster nodes
- High-performance parallel filesystem
- Persistent storage shared across all nodes
-
FSx for OpenZFS Filesystem
- Shared across all cluster nodes
- High-performance filesystem with snapshots and cloning capabilities
- Persistent storage shared across all nodes
-
Amazon S3
- Object storage for large datasets
- Fully persistent and durable
Default Configuration:
The default HyperPod lifecycle scripts automatically configure container runtimes to use alternative storage:
- HyperPod Slurm: Docker and containerd are configured to use
/opt/sagemakeror/opt/dlami/nvme - HyperPod EKS: Containerd and kubelet are configured to use
/opt/sagemakeror/opt/dlami/nvme
This prevents container images and layers from filling up the root volume.
Resolution Steps:
-
Check current disk usage:
# Check all mounted filesystems
df -h
# Identify what's consuming space on root volume
sudo du -h --max-depth=1 / | sort -hr | head -20 -
Redirect application data to secondary EBS volume:
# Use /opt/sagemaker for application data
export APP_DATA_DIR=/opt/sagemaker/my-app-data
mkdir -p $APP_DATA_DIR
# Redirect logs
export LOG_DIR=/opt/sagemaker/logs -
Use NVMe storage for temporary/scratch data:
# Use /opt/dlami/nvme for temporary files
export TMPDIR=/opt/dlami/nvme/tmp
mkdir -p $TMPDIR
# Redirect cache directories
export TORCH_HOME=/opt/dlami/nvme/torch_cache
export HF_HOME=/opt/dlami/nvme/huggingface_cache -
Configure training scripts to use alternative storage:
# In your training script
checkpoint_dir = "/opt/sagemaker/checkpoints"
cache_dir = "/opt/dlami/nvme/cache" -
Clean up root volume if already full:
# Remove old logs
sudo rm -f /var/log/*.log.* /var/log/*/*.gz
# Clean package manager cache
sudo apt-get clean # For Slurm (Ubuntu)
sudo yum clean all # For EKS (Amazon Linux)
# Remove old container images (if applicable)
docker system prune -a -
For Kubernetes pods, configure volume mounts:
volumes:
- name: secondary-ebs
hostPath:
path: /opt/sagemaker
- name: nvme-storage
hostPath:
path: /opt/dlami/nvme
volumeMounts:
- name: secondary-ebs
mountPath: /workspace
- name: nvme-storage
mountPath: /tmp
Best Practices:
- Plan ahead: Configure secondary EBS volume size appropriately during cluster creation
- Use appropriate storage:
- Persistent data → Secondary EBS or FSx
- Temporary data → NVMe storage
- Large datasets → FSx or S3
- Monitor disk usage: Set up CloudWatch alarms for disk space
- Avoid root volume: Never save large files or datasets to the root volume
- Container images: Ensure container runtime uses
/opt/sagemakeror/opt/dlami/nvme - Environment variables: Set cache directories to point to alternative storage:
export TORCH_HOME=/opt/sagemaker/torch_cache
export HF_HOME=/opt/sagemaker/huggingface_cache
export TRANSFORMERS_CACHE=/opt/sagemaker/transformers_cache
Prevention:
When creating a HyperPod cluster or adding instance groups, configure the secondary EBS volume size based on your needs:
- Size can be configured differently for each instance group
- New instance groups can be added after cluster creation with appropriate storage
- Consider the size of container images you'll use
- Account for logs, checkpoints, and temporary files
- Add buffer for unexpected growth (recommend 2-3x your estimated needs)
Utilities and How-To
How to Identify Instance ID from Slurm Node Name
Orchestrator: Slurm
Issue: Need to find the EC2 instance ID (e.g., i-abcd12345678) from a Slurm node name (e.g., ip-10-1-123-45)
Background:
On HyperPod Slurm clusters, nodes are named using their private IP addresses in the format ip-10-1-123-45. However, many AWS operations (SSM sessions, node replacement, CloudWatch logs) require the EC2 instance ID. This guide shows how to map between node names and instance IDs.
Resolution Steps:
Option 1: Query resource_config.json on Head Node
On the head node, the resource configuration file contains the mapping between IP addresses and instance IDs:
# Extract the IP address from the node name
# Example: ip-10-1-123-45 -> 10.1.123.45
NODE_NAME="ip-10-1-123-45"
IP_ADDRESS=$(echo $NODE_NAME | sed 's/ip-//; s/-/./g')
# Search for the instance ID in the resource config
sudo cat /opt/ml/config/resource_config.json | jq | grep -A 3 "$IP_ADDRESS"
This will show the instance details including the instance ID.
Option 2: Use HyperPod Service APIs
Use the HyperPod list-cluster-nodes and describe-cluster-node APIs to get node information:
# List all nodes in the cluster
aws sagemaker list-cluster-nodes --cluster-name <cluster-name>
# Describe a specific node
aws sagemaker describe-cluster-node \
--cluster-name <cluster-name> \
--node-id <instance-id>
Recommended Tool:
For easier lookup, use the dump_cluster_nodes_info.py tool from the awsome-distributed-training repository:
- Repository: https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/tools/dump_cluster_nodes_info.py
- This tool dumps all HyperPod node information into a CSV file
- You can easily lookup instance IDs from IP addresses or node names
- The CSV includes: instance ID, private IP, node name, instance type, availability zone, and status
Usage Example:
# Download the tool
wget https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/tools/dump_cluster_nodes_info.py
# Run it to generate CSV
python3 dump_cluster_nodes_info.py --cluster-name <cluster-name>
# This creates a CSV file you can search or open in a spreadsheet
cat cluster_nodes_info.csv | grep "10.1.123.45"
Getting Help
Collecting Diagnostic Data for Issue Reporting
Orchestrator: Common (Slurm, EKS)
When reporting issues to AWS Support, providing comprehensive diagnostic data helps expedite troubleshooting and resolution.
Recommended Tool:
Use the hyperpod_issue_report tool to automatically collect relevant diagnostic information from your HyperPod cluster:
- Repository: https://github.com/shimomut/sagemaker-solutions/tree/main/hyperpod_issue_report
- Follow the instructions in the README for installation and usage
If you continue to experience issues:
- Check CloudWatch Logs: Most services log detailed information to CloudWatch
- Review CloudFormation Events: Stack events provide deployment timeline and errors
- AWS Support: Open a support case with relevant logs and error messages
- GitHub Issues: Report bugs or request features in the project repository