Skip to main content

Inject an Error

Submit the Training Job

Start by submitting the training job on a fresh run:

sbatch train.sbatch

Tail the logs for the job:

tail -f logs/picotron_$(squeue -h -u $USER -o "%i" | head -1).out

Allow a few training steps to complete so that we have a corresponding checkpoint.

Inject the Error

  1. Run squeue to see which host your job is running on:
squeue
  1. SSH into one of the hosts that's not the first node in the list:
ssh ip-10-1-0-16
  1. Inject an ECC Error:
dcgmi test --inject --gpuid 0 -f 319 -v 4
  1. Kill the training process to simulate a job failure:
kill -9 $(ps -aux | grep "python3 -u /picotron/train.py" | grep -v grep | awk '{print $2}')

Resume from Checkpoint

Now that a node failure has been triggered, find the latest saved checkpoint and update the load_path:

previous_checkpoint=$(ls -1 checkpoints/ | sort -n | tail -n 2 | head -n 1)
jq --arg path "$previous_checkpoint" \
'.checkpoint.load_path = .checkpoint.save_dir + "/" + $path' \
conf/llama-1B-tp2/config.json > tmp.json && mv tmp.json conf/llama-1B-tp2/config.json

Verify the updates:

cat conf/llama-1B-tp2/config.json