Introduction
💡 Optimized Blueprints for deploying high performance clusters to train, fine tune, and host (inference) models on Amazon Sagemaker Hyperpod

Train, fine tune, and host generative AI models on Amazon Sagemaker Hyperpod
Welcome to the AI on Sagemaker Hyperpod, your home for deploying large distributed training clusters on Amazon Sagemaker Hyperpod.
What can you find here
This is the home for all things related to Amazon Sagemaker Hyperpod, built by the ML Frameworks team at AWS with support from the open source community. We strive to deliver content and assets based on real-world use cases and customer feedback.
Explore practical examples, architectural patterns, troubleshooting, and many other contents. Work through running large distributed training jobs, fine tuning, distillation, and preference alignment, using frameworks such as PyTorch, JAX, NeMo, Ray, etc. We provide examples for Meta's Llama, Amazon Nova, Mistral, DeepSeek, and others.
There is troubleshooting advise on specific problems you may find, best practices when integrating with other AWS services and open source projects, and code snippets that you may find useful to incorporate on your workloads.
Note: AI On Sagemaker Hyperpod is an active development. For upcoming features and enhancements, please check out the issues section.
Examples provided
Those are the examples you can find on this project:
-
Running a Fully Sharded Data Parallel training example on multiple GPUs
-
Running a Distributed Data Parallel training example using CPU only
-
Setting up Task Governance and Task Affinity for improved cluster governance and utilization
Getting Started
Before delighting yourself with the features and examples provided here, we suggest you work through the setup of your Sagemaker Hyperpod cluster. On that initial step, we provide examples on how to do it using different methods (GUI, CLI scripts, Infrastructure as a Code - IaC, etc). After deploying your cluster, we recommend running a few basic tests to validate you have a working cluster running as expected.
Then you can select which of the scenarios you want to work on. On every scenario we have two possible orchestration choices: using SLURM or EKS. You should select the specific example you want to go through and the specific orchestration engine you are using on your cluster.
Prerequisites
Before getting started with SageMaker HyperPod, configure your environment with the required tools.
Install the AWS CLI
The AWS CLI comes pre-installed on AWS CloudShell.
- Linux_x86_64
- Linux_ARM64
- MacOS
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install --update
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /
Configure AWS Credentials
Please refer to this documentation to understand the different ways to acquire AWS Credentials to use AWS CLI.
For simplicity and to demonstrate the process of configuring AWS credentials for the CLI, we are going to use long-term access keys for designated IAM Users.
To maintain a proper security posture we recommend either using short-term credentials or setting up AWS IAM Identity Center (formerly AWS SSO) for short-term credentials.
1. Acquire AWS access long-term credentials
Please visit this documentation to learn how to acquire these credentials from the AWS console.
2. Configure AWS CLI
Using the credentials you fetched above, use aws configure to add the credentials to your terminal. See configure aws credentials for more details.
aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json
3. Set AWS Region
Next you can set the environment variable for AWS_REGION to ensure it points to the region where you intend to stand up your infrastructure. AWS CLI provides command line arguments as documented here to override certain variables.
export AWS_REGION=us-west-2
For more information on the aws configure cli command please refer to this CLI reference documentation
Install Kubectl (for EKS only)
You will use kubectl to interact with the EKS cluster Kubernetes API server. See the Kubernetes documentation for official installation steps.
- Linux (x86_64)
- Linux (arm64)
- macOS (arm64)
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/arm64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/arm64/kubectl.sha256"
echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/arm64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/arm64/kubectl.sha256"
echo "$(cat kubectl.sha256) kubectl" | shasum -a 256 --check
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
sudo chown root: /usr/local/bin/kubectl
kubectl version --client
Install Helm (for EKS only)
Install eksctl (for EKS only)
You can use eksctl to create an IAM OIDC provider and install CSI drivers. See the eksctl documentation for alternative installation options.
# for ARM systems, set ARCH to: `arm64`, `armv6` or `armv7`
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin
Install Terraform
If you plan to use Terraform to deploy your infrastructure, see the Terraform documentation for installation instructions.
Documentation
Amazon Sagemaker Hyperpod is part of the Amazon Sagemaker AI family of AI focused managed services on AWS. The documentation focus on helping customers setup their clusters and AWS accounts.
This repository strive to go further and help customers setup the additional software stack required to quickly conduct proof-of-concepts and build production-ready clusters.
Support & Feedback
AI on Sagemaker Hyperpod is maintained by the AWS ML Frameworks team and is not an AWS service. Support is provided on a best effort basis by the AI on Sagemaker Hyperpod community. If you have feedback, feature ideas, or wish to report bugs, please use the Issues section of this Github.
Security
See CONTRIBUTING for more information.
License
This library is licensed under the Apache 2.0 License.
Community
We're building an open-source community focused on Development and Inference of Generative AI models on ML Frameworks.
Come join us and contribute to shaping the future of AI on Amazon Sagemaker Hyperpod.
Built with ❤️ at AWS.