
Remove interruptions with a resilient development environment
Automatically detects, diagnoses, and recovers from infrastructure faults. Run model development workloads continuously for months without interruption through intelligent fault management and self-healing capabilities.

Efficiently scale and parallelize model training across thousands of AI accelerators
Automatically splits models and datasets across AWS cluster instances for efficient scaling. Optimizes training jobs for AWS network infrastructure and cluster topology. Streamlines checkpointing with optimized frequency to minimize training overhead.

Achieve state-of-the-art performance with recipes and tools
Pre-built recipes enable rapid training and fine-tuning of generative AI models in minutes. Customize Amazon Nova foundation models for business-specific use cases while maintaining industry-leading performance. Built-in experimentation and observability tools help enhance model performance across all skill levels.

Reduce costs with centralized governance over all model development tasks
Provides full visibility and control over compute resource allocation for training and inference tasks. Automatically manages task queues, prioritizing critical work to meet deadlines and budgets. Efficient resource utilization reduces model development costs by up to 40%.
Learn with Video Tutorials
Watch these tutorials to master Amazon SageMaker HyperPod
Accelerate FM pre-training on Amazon SageMaker HyperPod (Amazon EKS)
Amazon SageMaker HyperPod is purpose-built to reduce time to train foundation models (FMs) by up to 40% and scale across more than a thousand AI accelerators efficiently. In this video, learn about Amazon EKS support in SageMaker HyperPod to accelerate your FM training.
Learn more at: https://go.aws/3TUKZSs
Accelerate FM pre-training on Amazon SageMaker HyperPod (Slurm)
Amazon SageMaker HyperPod is purpose-built to reduce time to train foundation models (FMs) by up to 40% and scale across more than a thousand AI accelerators efficiently. In this video, dive into how to run distributed training on SageMaker HyperPod.
Learn more at: https://go.aws/3TUKZSs
Get started with Amazon SageMaker HyperPod flexible training plans
Amazon SageMaker HyperPod helps you scale and accelerate generative AI model development. In this video, you will learn how to use the flexible training plans feature to run efficient model training that aligns with your timelines and budgets.
Learn more about Amazon SageMaker HyperPod - https://go.aws/3WwsBA3