Amazon SageMaker HyperPod - Slurm Workshop

Comprehensive Reference Documentation

This workshop provides a guided, hands-on sandbox experience (think POC). For production deployment guidelines and best practices, see the Slurm Orchestration reference documentation.

Prerequisites

Before starting, make sure you have AWS CLI credentials configured and the required IAM permissions. See the Prerequisites section for setup instructions. Additional infrastructure deployment steps (VPC, observability stack) are called out in each section.

Amazon SageMaker HyperPod offers advanced training tools to help you accelerate scalable, reliable, and secure generative AI application development. In this workshop, you will experience how to train a large language model (LLM) on diverse, representative data and learn how to utilize the latest SageMaker model training tools to troubleshoot convergence issues and improve the model performance.

What You Will Learn

In this workshop you will:

Deploy a SageMaker HyperPod Slurm cluster
Run a sample Picotron distributed training job on ml.g5.8xlarge A10 GPU based instances
Test cluster resiliency with error injection and auto-recovery
Set up monitoring with Prometheus and Grafana

Target Audience

The intended audience for this workshop is:

ML Researchers/Scientists
ML Engineers
ML Infrastructure Admins
HPC Engineers

Workshop Duration

This workshop takes approximately 2 hours to complete.

Region

This workshop is intended to be run in the us-west-2 region.

What You Will Learn​

Target Audience​

Workshop Duration​

Region​

What You Will Learn

Target Audience

Workshop Duration

Region