AI on SageMaker HyperPod

Remove interruptions with a resilient development environment

Automatically detects, diagnoses, and recovers from infrastructure faults. Run model development workloads continuously for months without interruption through intelligent fault management and self-healing capabilities.

Efficiently scale and parallelize model training across thousands of AI accelerators

Automatically splits models and datasets across AWS cluster instances for efficient scaling. Optimizes training jobs for AWS network infrastructure and cluster topology. Streamlines checkpointing with optimized frequency to minimize training overhead.

Achieve state-of-the-art performance with recipes and tools

Pre-built recipes enable rapid training and fine-tuning of generative AI models in minutes. Customize Amazon Nova foundation models for business-specific use cases while maintaining industry-leading performance. Built-in experimentation and observability tools help enhance model performance across all skill levels.

Reduce costs with centralized governance over all model development tasks

Provides full visibility and control over compute resource allocation for training and inference tasks. Automatically manages task queues, prioritizing critical work to meet deadlines and budgets. Efficient resource utilization reduces model development costs by up to 40%.

Learn with Video Tutorials

Watch these tutorials to master Amazon SageMaker HyperPod

3:45

Accelerate FM pre-training on Amazon SageMaker HyperPod (Amazon EKS)

Amazon SageMaker HyperPod is purpose-built to reduce time to train foundation models (FMs) by up to 40% and scale across more than a thousand AI accelerators efficiently. In this video, learn about Amazon EKS support in SageMaker HyperPod to accelerate your FM training.
Learn more at: https://go.aws/3TUKZSs

4:12

Accelerate FM pre-training on Amazon SageMaker HyperPod (Slurm)

Amazon SageMaker HyperPod is purpose-built to reduce time to train foundation models (FMs) by up to 40% and scale across more than a thousand AI accelerators efficiently. In this video, dive into how to run distributed training on SageMaker HyperPod.
Learn more at: https://go.aws/3TUKZSs

5:28

Get started with Amazon SageMaker HyperPod flexible training plans

Amazon SageMaker HyperPod helps you scale and accelerate generative AI model development. In this video, you will learn how to use the flexible training plans feature to run efficient model training that aligns with your timelines and budgets.
Learn more about Amazon SageMaker HyperPod - https://go.aws/3WwsBA3

AI on SageMaker HyperPod

Scaling seismic foundation models on AWS with SageMaker HyperPod

Accelerating AI model production at Hexagon with SageMaker HyperPod

Checkpointless Training on Amazon SageMaker HyperPod

Adaptive Infrastructure with Elastic Training on SageMaker HyperPod

Speed up cluster procurement with SageMaker HyperPod training plans

SageMaker HyperPod and Anyscale for distributed computing