Amazon SageMaker HyperPod - Slurm Workshop
This workshop provides a guided, hands-on sandbox experience (think POC). For production deployment guidelines and best practices, see the Slurm Orchestration reference documentation.
Before starting, make sure you have AWS CLI credentials configured and the required IAM permissions. See the Prerequisites section for setup instructions. Additional infrastructure deployment steps (VPC, observability stack) are called out in each section.
Amazon SageMaker HyperPod offers advanced training tools to help you accelerate scalable, reliable, and secure generative AI application development. In this workshop, you will experience how to train a large language model (LLM) on diverse, representative data and learn how to utilize the latest SageMaker model training tools to troubleshoot convergence issues and improve the model performance.
What You Will Learn
In this workshop you will:
- Deploy a SageMaker HyperPod Slurm cluster
- Run a sample Picotron distributed training job on
ml.g5.8xlargeA10 GPU based instances - Test cluster resiliency with error injection and auto-recovery
- Set up monitoring with Prometheus and Grafana
Target Audience
The intended audience for this workshop is:
- ML Researchers/Scientists
- ML Engineers
- ML Infrastructure Admins
- HPC Engineers
Workshop Duration
This workshop takes approximately 2 hours to complete.
Region
This workshop is intended to be run in the us-west-2 region.