Testing Resiliency with HyperPod Slurm

This guide demonstrates how to test and validate the resiliency features of SageMaker HyperPod when using Slurm as the orchestrator. You'll learn how to submit resilient training jobs, inject failures, monitor cluster recovery, and manually replace nodes.

Test Case Overview

In this example, we'll:

Submit a training job using Megatron-LM with checkpointing enabled
Inject an XID Error to simulate hardware failure
Monitor the cluster's automatic recovery process
Observe job resumption from the last checkpoint

Prerequisites

Before proceeding, ensure you've completed the Megatron-LM setup for your HyperPod Slurm cluster.

1. Submit a Resilient Training Job

Megatron-LM supports checkpoint parameters that enable automatic job recovery:

Parameter	Value	Description
`--save`	/fsx/checkpoints	Output directory to save checkpoints
`--save-interval`	1	Number of iterations between checkpoint saves
`--load`	/fsx/checkpoints	Directory containing model checkpoint

Configure the Training Script

Let us go back to the megatron-lm example directory in the awsome-distributed-training repository to execute this.

cd ~/awsome-distributed-training/3.test_cases/megatron/megatron-lm/slurm/gpt3

Modify your 2.distributed-training.sbatch script to include auto-resume and checkpointing:

# Add auto-resume flag to srun
srun ${AUTO_RESUME} -l "${ARGS[@]}" bash -c '
export MASTER_ADDR=$SLURM_LAUNCH_NODE_IPADDR;
export MASTER_PORT=6000;
export WORLD_SIZE=$SLURM_NTASKS;
export RANK=$SLURM_PROCID;
export LOCAL_RANK=$SLURM_LOCALID;
$NSYS_PROFILING \
python3 -u /workspace/Megatron-LM/pretrain_gpt.py \
$NSYS_PROFILING_MEGATRON_ARGS \
--save /fsx/checkpoints \
--save-interval 1 \
--load /fsx/checkpoints \
--num-layers $NUM_LAYERS \
--hidden-size $HIDDEN_SIZE \
--num-attention-heads $NUM_ATTENTION_HEADS \
--seq-length $SEQ_LENGTH \
--max-position-embeddings $MAX_POSITION_EMBEDDINGS \
--micro-batch-size $MICRO_BATCH_SIZE \
--global-batch-size $GLOBAL_BATCH_SIZE \
--tensor-model-parallel-size $TENSOR_PARALLEL \
--pipeline-model-parallel-size $PIPELINE_PARALLEL \
--train-samples 146484375 \
--lr-decay-samples 126953125 \
--lr-warmup-samples 183105 \
--lr 6.0e-5 \
--min-lr 6.0e-6 \
--lr-decay-style cosine \
--log-interval 1 \
--eval-iters 40 \
--eval-interval 1000 \
--data-path ${DATA_PATH}/gpt2/my-gpt2_text_document \
--vocab-file ${DATA_PATH}/gpt2/gpt2-vocab.json \
--merge-file ${DATA_PATH}/gpt2/gpt2-merges.txt \
--split 98,2,0 \
--clip-grad 1.0 \
--weight-decay 0.1 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--init-method-std 0.006 \
--fp16 \
--recompute-activations '

Key additions:

${AUTO_RESUME} flag after srun enables automatic job resumption
--save /fsx/checkpoints specifies checkpoint output directory
--save-interval 1 saves checkpoint after each iteration (use higher values for production)
--load /fsx/checkpoints enables loading from last checkpoint

Submit the Job

Submit the training job:

sbatch 2.distributed-training.sbatch

Monitor job progress by tailing the log file:

tail -f slurm-<job-id>.log

You should see output indicating successful checkpointing:

iteration        1/  508626 | consumed samples:          288 | elapsed time per iteration (ms): 440352.6 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
saving checkpoint at iteration       1 to /fsx/checkpoints
  successfully saved checkpoint at iteration       1 to /fsx/checkpoints
(min, max) time across ranks (ms):
    save-checkpoint ................................: (81611.24, 81611.82)

Verify checkpoint creation:

ls -lt /fsx/checkpoints/

Expected output:

total 74
-rw-rw-r--  1 ubuntu ubuntu     1 Dec  9 00:21 latest_checkpointed_iteration.txt
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:20 iter_0000002
drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:11 iter_0000001

2. Inject Hardware Failure

Now we'll simulate a hardware failure to test the cluster's resiliency.

Identify Target Node

Check which nodes your job is running on:

squeue

Inject ECC Error

SSH into one of the worker nodes (not the first node):

ssh ip-10-1-0-16

Inject an ECC Error using DCGM:

dcgmi test --inject --gpuid 0 -f 319 -v 4

Simulate Process Failure

Kill the training process to simulate job failure:

# Find the Python training process
ps -aux | grep python

# Kill the process (replace <PID> with actual process ID)
kill -9 <PID>

3. Monitor Cluster Recovery

Monitor Slurm Controller Logs

To observe the node replacement process, monitor the Slurm controller logs:

tail -f /var/log/slurm/slurmctld.log

You'll see entries indicating node failure and replacement:

[2024-03-06T16:26:50.915] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2024-03-08T00:15:02.047] error: _find_node_record: lookup failure for node "ip-10-1-57-141"
[2024-03-08T00:15:02.047] error: update_node: node ip-10-1-57-141 does not exist
[2024-03-08T00:15:02.047] _slurm_rpc_update_node for ip-10-1-57-141: Invalid node name specified
[2024-03-08T00:15:18.986] update_node: node ip-10-1-73-87 reason set to: Action:Replace
[2024-03-08T00:15:18.986] update_node: node ip-10-1-73-87 state set to FAIL

Check Node Status

Monitor node status changes:

sinfo

You'll see the node transition through different states:

failg (failing while job is running)
fail (failed after job termination)
Eventually replaced with a new node

Monitor Job Queue

Check job status:

squeue

The job should automatically resume once the replacement node is available and healthy.

4. Manual Node Replacement

You can also manually trigger node replacement when needed.

Replace a Specific Node

To manually replace a node, use the scontrol command:

sudo scontrol update node=ip-10-1-57-141 state=down reason="Action:Replace"

Note

Replace ip-10-1-57-141 with the actual hostname of the node you want to replace.

Monitor Replacement Process

Check immediate node status change:

sinfo

The node will show as failg state:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dev*         up   infinite      1  failg ip-10-1-57-141
dev*         up   infinite      1  alloc ip-10-1-69-138

While jobs are running, the node remains active:

squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
9       dev megatron   ubuntu  R      19:26      2 ip-10-1-57-141,ip-10-1-69-138

After job termination, the node enters fail state and gets terminated:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
dev*         up   infinite      1   fail ip-10-1-57-141
dev*         up   infinite      1   idle ip-10-1-69-138

Monitor in SageMaker Console

You can monitor the replacement process in the SageMaker Console where you'll see the new node being provisioned.

Post-Replacement Setup

When the replacement node is ready, you may need to remap the home directory:

# Remap home directory for the replaced node
srun -w ip-10-1-57-141 usermod -d /fsx/ubuntu ubuntu

# SSH into the new node
ssh ip-10-1-57-141

SSH Key Warning

You may encounter a host key verification error because the new node uses the same hostname. Remove the old key:

ssh-keygen -f "/fsx/ubuntu/.ssh/known_hosts" -R "ip-10-1-57-141"

Then SSH again to accept the new host key.

Best Practices

Checkpoint Frequency: Set appropriate --save-interval values based on your training duration and checkpoint overhead
Shared Storage: Ensure checkpoints are saved to shared storage (FSx) accessible by all nodes
Job Monitoring: Regularly monitor job logs and node status during long-running training jobs
Resource Planning: Account for temporary capacity reduction during node replacement
Testing: Regularly test resiliency features in non-production environments

Troubleshooting

Job Not Resuming

Verify checkpoint files exist in the specified directory
Check Slurm logs for scheduling issues
Ensure replacement nodes have proper access to shared storage

Node Replacement Delays

Check AWS service limits and capacity availability
Monitor CloudWatch logs for detailed error messages
Verify IAM permissions for node replacement operations

Checkpoint Corruption

Implement checkpoint validation in your training script
Use multiple checkpoint directories for redundancy
Monitor storage health and capacity

Test Case Overview​

1. Submit a Resilient Training Job​

Configure the Training Script​

Submit the Job​

2. Inject Hardware Failure​

Identify Target Node​

Inject ECC Error​

Simulate Process Failure​

3. Monitor Cluster Recovery​

Monitor Slurm Controller Logs​

Check Node Status​

Monitor Job Queue​

4. Manual Node Replacement​

Replace a Specific Node​

Monitor Replacement Process​

Monitor in SageMaker Console​

Post-Replacement Setup​

Best Practices​

Job Not Resuming​

Node Replacement Delays​

Checkpoint Corruption​