Skip to main content

Diagnose GPU Failures

To diagnose a node with a bad gpu ip-10-1-69-242 on SageMaker HyperPod, do the following:

  1. Run the nvidia reset command:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0
  1. If that doesn't success then generate a bug report:
srun -w ip-10-1-69-242 nvidia-bug-report.sh
  1. Grab the instance id:
srun -w ip-10-1-69-242 cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " "
  1. Grab the output of nvidia-bug-report.sh and replace that instance:
sudo scontrol update node=ip-10-1-69-242 state=down reason="Action:Replace"