Easy Cluster Setup
Overview
In this section, we provide you with a script that will walk you through the following:
- Installing the right packages in your environment (e.g., AWS CLI)
- Installing the Lifecycle Scripts used by HyperPod nodes during bootstrapping
- Creating the cluster configuration, and uploading all your assets to Amazon S3 for cluster creation
The script guides you through the entire onboarding process so you can create a SageMaker HyperPod SLURM cluster by yourself.
Important Information
You will be asked questions by the script for customizing your HyperPod cluster. For example:
# In this example, you are asked to input the instance type for your cluster's worker group.
# The default value is ml.g5.8xlarge
Enter the instance type for your worker group [ml.g5.8xlarge]:
Since this is a workshop session, the default values are pre-configured for you. You may hit ENTER for every question asked by the script!
Setup Environment
To run this automation script, run it on your provided CodeEditor terminal:
# Clone the repository
mkdir hyperpod && cd hyperpod
curl -O https://raw.githubusercontent.com/awslabs/awsome-distributed-ai/refs/heads/main/1.architectures/5.sagemaker-hyperpod/automate-smhp-slurm/automate-cluster-creation.sh
# Run the script
bash automate-cluster-creation.sh
Once you get through the script, you should see a directory that looks like:
hyperpod/
|-- automate-cluster-creation.sh # This script!
|-- awsome-distributed-training/ # The script clones the repo with all the Lifecycle Scripts
|-- cluster-config.json # Cluster configuration generated by this script
|-- create_config.sh # The shell script used to write your environment variables
|-- env_vars # Your environment variables used to create the SMHP cluster
|-- provisioning_parameters.json # The provisioning parameters file (already uploaded to S3)
|-- validate-config.py # Python script that checks cluster configuration validity
The last input of the script gives you the option on whether or not you want the cluster created for you. The default option is yes unless you submit no. Hit ENTER to let the script create your HyperPod cluster for you. If you did not hit ENTER, paste the following command:
aws sagemaker create-cluster \
--cli-input-json file://cluster-config.json \
--region $AWS_REGION
Monitor Cluster Creation Status
You can check the status of your cluster using the command below:
aws sagemaker list-clusters --output table
Cluster creation typically takes 12-15 minutes.