Diagnose GPU Failures
To diagnose a node with a bad gpu ip-10-1-69-242
on SageMaker HyperPod, do the following:
- Run the nvidia reset command:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0
- If that doesn't success then generate a bug report:
srun -w ip-10-1-69-242 nvidia-bug-report.sh
- Grab the instance id:
srun -w ip-10-1-69-242 cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " "
- Grab the output of
nvidia-bug-report.sh
and replace that instance:
sudo scontrol update node=ip-10-1-69-242 state=down reason="Action:Replace"