Introduction

💡 Optimized Blueprints for deploying high performance clusters to train, fine tune, and host (inference) models on Amazon Sagemaker Hyperpod

SageMaker Logo

Train, fine tune, and host generative AI models on Amazon Sagemaker Hyperpod

Welcome to the AI on Sagemaker Hyperpod, your home for deploying large distributed training clusters on Amazon Sagemaker Hyperpod.

What can you find here

This is the home for all things related to Amazon Sagemaker Hyperpod, built by the ML Frameworks team at AWS with support from the open source community. We strive to deliver content and assets based on real-world use cases and customer feedback.

Explore practical examples, architectural patterns, troubleshooting, and many other contents. Work through running large distributed training jobs, fine tuning, distillation, and preference alignment, using frameworks such as PyTorch, JAX, NeMo, Ray, etc. We provide examples for Meta's Llama, Amazon Nova, Mistral, DeepSeek, and others.

There is troubleshooting advise on specific problems you may find, best practices when integrating with other AWS services and open source projects, and code snippets that you may find useful to incorporate on your workloads.

Note: AI On Sagemaker Hyperpod is an active development. For upcoming features and enhancements, please check out the issues section.

Examples provided

Those are the examples you can find on this project:

Getting Started

Before delighting yourself with the features and examples provided here, we suggest you work through the setup of your Sagemaker Hyperpod cluster. On that initial step, we provide examples on how to do it using different methods (GUI, CLI scripts, Infrastructure as a Code - IaC, etc). After deploying your cluster, we recommend running a few basic tests to validate you have a working cluster running as expected.

Then you can select which of the scenarios you want to work on. On every scenario we have two possible orchestration choices: using SLURM or EKS. You should select the specific example you want to go through and the specific orchestration engine you are using on your cluster.

Prerequisites

Before getting started with SageMaker HyperPod, configure your environment with the required tools.

Install the AWS CLI

note

The AWS CLI comes pre-installed on AWS CloudShell.

Linux_x86_64
Linux_ARM64
MacOS

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install --update

curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /

Configure AWS Credentials

Please refer to this documentation to understand the different ways to acquire AWS Credentials to use AWS CLI.

For simplicity and to demonstrate the process of configuring AWS credentials for the CLI, we are going to use long-term access keys for designated IAM Users.

Important

To maintain a proper security posture we recommend either using short-term credentials or setting up AWS IAM Identity Center (formerly AWS SSO) for short-term credentials.

1. Acquire AWS access long-term credentials

Please visit this documentation to learn how to acquire these credentials from the AWS console.

2. Configure AWS CLI

Using the credentials you fetched above, use aws configure to add the credentials to your terminal. See configure aws credentials for more details.

aws configure
AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-west-2
Default output format [None]: json

3. Set AWS Region

Next you can set the environment variable for AWS_REGION to ensure it points to the region where you intend to stand up your infrastructure. AWS CLI provides command line arguments as documented here to override certain variables.

export AWS_REGION=us-west-2

For more information on the aws configure cli command please refer to this CLI reference documentation

Install Kubectl (for EKS only)

You will use kubectl to interact with the EKS cluster Kubernetes API server. See the Kubernetes documentation for official installation steps.

Linux (x86_64)
Linux (arm64)
macOS (arm64)

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(cat kubectl.sha256)  kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/arm64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/arm64/kubectl.sha256"
echo "$(cat kubectl.sha256)  kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/arm64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/arm64/kubectl.sha256"
echo "$(cat kubectl.sha256)  kubectl" | shasum -a 256 --check
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
sudo chown root: /usr/local/bin/kubectl

kubectl version --client

Install Helm (for EKS only)

Helm is a package manager for Kubernetes that will be used to install various dependencies using Charts.

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Install eksctl (for EKS only)

You can use eksctl to create an IAM OIDC provider and install CSI drivers. See the eksctl documentation for alternative installation options.

# for ARM systems, set ARCH to: `arm64`, `armv6` or `armv7`
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin

Install Terraform

If you plan to use Terraform to deploy your infrastructure, see the Terraform documentation for installation instructions.

Documentation

Amazon Sagemaker Hyperpod is part of the Amazon Sagemaker AI family of AI focused managed services on AWS. The documentation focus on helping customers setup their clusters and AWS accounts.

This repository strive to go further and help customers setup the additional software stack required to quickly conduct proof-of-concepts and build production-ready clusters.

Support & Feedback

AI on Sagemaker Hyperpod is maintained by the AWS ML Frameworks team and is not an AWS service. Support is provided on a best effort basis by the AI on Sagemaker Hyperpod community. If you have feedback, feature ideas, or wish to report bugs, please use the Issues section of this Github.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the Apache 2.0 License.

Community

We're building an open-source community focused on Development and Inference of Generative AI models on ML Frameworks.

Come join us and contribute to shaping the future of AI on Amazon Sagemaker Hyperpod.

Built with ❤️ at AWS.

Train, fine tune, and host generative AI models on Amazon Sagemaker Hyperpod​

What can you find here​

Examples provided​

Getting Started​

Prerequisites​

Documentation​

Support & Feedback​

Security​

License​

Community​