Inject an Error

Submit the Training Job

Start by submitting the training job on a fresh run:

sbatch train.sbatch

Tail the logs for the job:

tail -f logs/picotron_$(squeue -h -u $USER -o "%i" | head -1).out

Allow a few training steps to complete so that we have a corresponding checkpoint.

Inject the Error

Run squeue to see which host your job is running on:

squeue

SSH into one of the hosts that's not the first node in the list:

ssh ip-10-1-0-16

Inject an ECC Error:

dcgmi test --inject --gpuid 0 -f 319 -v 4

Kill the training process to simulate a job failure:

kill -9 $(ps -aux | grep "python3 -u /picotron/train.py" | grep -v grep | awk '{print $2}')

Resume from Checkpoint

Now that a node failure has been triggered, find the latest saved checkpoint and update the load_path:

previous_checkpoint=$(ls -1 checkpoints/ | sort -n | tail -n 2 | head -n 1)
jq --arg path "$previous_checkpoint" \
  '.checkpoint.load_path = .checkpoint.save_dir + "/" + $path' \
  conf/llama-1B-tp2/config.json > tmp.json && mv tmp.json conf/llama-1B-tp2/config.json

Verify the updates:

cat conf/llama-1B-tp2/config.json

Submit the Training Job​

Inject the Error​

Resume from Checkpoint​

Submit the Training Job

Inject the Error

Resume from Checkpoint