Delete Cluster Nodes
The SageMaker BatchDeleteClusterNode API allows you to delete specific nodes within a SageMaker HyperPod cluster. BatchDeleteClusterNodes accepts a cluster name and a list of node IDs.
Note: The following instructions apply to cluster created after 6/20/2024. If your cluster was created before this date, you will need to run aws sagemaker update-cluster-software
before executing below steps. See release notes.
aws sagemaker batch-delete-cluster-nodes --cluster-name "cluster-name" --node-ids '["i-111112222233333", "i-111112222233333"]'
Below are the recommended steps to use this API safely on a production HyperPod Cluster, without disrupt running jobs on your cluster.
Prepare cluster for node deletion
Set node state to DOWN
Run the following on the compute/login node of your HyperPod Cluter
- List the current nodes and their status withing the cluster:
sinfo
- The following command will generate a list of ALL IDLE Node IP Addresses, which can be passed to scontrol:
export IDLE_NODES_TO_TERMINATE=$(sinfo --noheader --state=idle -o "%N")
- It is best practice to set the nodes in DOWN state in Slurm before Terminating them. This will ensure the slurm scheduler will not allocate new jobs on these nodes. Run the below command to set all IDLE nodes in the cluster to DOWN.
The below example will set all IDLE nodes to Down. If you want to specify specific nodes to terminate, you can do so instead with: sudo scontrol update NodeName=<Node_IP> State=Down Reason="Termination"
sudo scontrol update NodeName=$IDLE_NODES_TO_TERMINATE State=Down Reason="Termination"
- Confirm the nodes are set to DOWN in Slurm
export NODES_SET_FOR_TERMINATION=$(sinfo -R --noheader -o "%N")
echo $NODES_SET_FOR_TERMINATION
- Broadcast a message to other users on the cluster:
sudo wall "The nodes $NODES_SET_FOR_TERMINATION have been set to DOWN and are scheduled for termination. Please take note"
Confirm InstanceID of DOWN nodes:
- Create and run the following script. The
get-node-status.sh
script will Retrieve the InstanceID for each of the Cluster Nodes. This script will parse the/opt/ml/config/resource_config.json
file created by HyperPod to get the instance IDs.
Create the script:
cat <<'EOF' > get-node-status.sh
#!/bin/bash
# Print the header for the table
printf "%-20s %-20s %-30s %-10s %-40s\n" "IP Address" "Instance ID" "Instance Name" "Status" "Reason"
printf "%-20s %-20s %-30s %-10s %-40s\n" "--------------------" "--------------------" "------------------------------" "----------" "----------------------------------------"
# Extract details and check node status with scontrol
sudo cat /opt/ml/config/resource_config.json | jq -r '.InstanceGroups[].Instances[] | "\(.CustomerIpAddress) \(.InstanceId) \(.InstanceName)"' | while read -r ip instance_id instance_name; do
# Format IP address by replacing dots with dashes
formatted_ip="ip-${ip//./-}"
# Run scontrol to get the node status using formatted IP
node_status=$(scontrol show node "$formatted_ip" | grep -oP 'State=\K\w+')
# Get the reason from sinfo -R, match the formatted IP with the nodes
node_reason=$(sinfo -R --noheader | grep "$formatted_ip" | awk '{print $1}')
# Print each line in a formatted way
printf "%-20s %-20s %-30s %-10s %-40s\n" "$formatted_ip" "$instance_id" "$instance_name" "${node_status:-unknown}" "${node_reason:-unknown}"
done
EOF
- Run the script:
bash get-node-status.sh
Execute Node Deletion
To execute the batch delete-cluster-node command, you will need to call the API from a development environment with the AWS CLI installed. You can use AWS CloudShell from within your AWS Account, which comes pre-installed with the AWS CLI
- Execute the batch-delete-cluster-nodes api:
#Confirm cluster name
aws sagemaker list-clusters --region <YOUR_REGION>
#Confirm cluster nodes
aws sagemaker list-cluster-nodes --cluster-name <CLUSTER_NAME> --region <YOUR_REGION>
aws sagemaker batch-delete-cluster-nodes --cluster-name <YOUR_CLUSTER_NAME> --node-ids [InstanceIDs] --region <YOUR_REGION>
- When the
batch-delete-cluster-nodes
API has been executed successfully, you will see an output similar to the following:
$ aws sagemaker batch-delete-cluster-nodes --cluster-name ml-cluster --node-ids i-04486002ebdb59e0a i-0b8bc2c52dc7b1fa2
{
"Successful": [
"i-04486002ebdb59e0a",
"i-0b8bc2c52dc7b1fa2"
]
}
If you are using Reserved Instances for your HyperPod Cluster (via a Neogatiated PPA), you will continue to be billed for nodes which have been deleted from your cluster. If you wish to no longer be billed for the deleted instances, contact your AWS Account / Support team to notify them of the node termination and request for billing to be termianted for the deleted nodes.