Skip to main content

Get to Know Your Cluster

Reference Documentation

For comprehensive Slurm command reference and cluster management, see Slurm Basics.

Now that you've created and set up the cluster, let's go through some of the commands you'll use.

SLURM

SLURM from SchedMD is one of the batch schedulers that you can use in SageMaker HyperPod. For an overview of the SLURM commands, see the SLURM Quick Start User Guide.

Check Cluster Status

Running sinfo shows the partition we created:

sinfo
StateDescription
idleInstance is not running any jobs but is available
mixInstance is partially allocated
allocInstance is completely allocated

Check Job Queue

squeue

Shared Filesystems

A few volumes are shared by the head-node and will be mounted on compute instances when they boot up. You can see network mount filesystems such as the /fsx FSx Lustre filesystem:

df -h

SSH to Compute Nodes

Let's SSH to the compute nodes for interactive testing:

  1. Make sure you're logged in as ubuntu:
whoami  # should show ubuntu
  1. Allocate an interactive node and SSH in:
salloc -N 1
ssh $(srun hostname)

When done, exit back to the Head Node:

exit  # exit the SSH session
exit # cancel the srun job
Pro-tip

Update your bash prompt to show if you're on a CONTROLLER or WORKER node:

echo -e "\n# Show (CONTROLLER) or (WORKER) on the CLI prompt" >> ~/.bashrc
echo 'head_node_ip=$(sudo cat /opt/ml/config/resource_config.json | jq '"'"'.InstanceGroups[] | select(.Name == "controller-machine") | .Instances[0].CustomerIpAddress'"'"' | tr -d '"'"'"'"'"'"')' >> ~/.bashrc
echo 'if [ $(hostname -I | awk '"'"'{print $1}'"'"') = $head_node_ip ]; then PS1="(CONTROLLER) ${PS1}"; else PS1="(WORKER) ${PS1}"; fi' >> ~/.bashrc
source ~/.bashrc