Skip to main content

AI on SageMaker HyperPod

Optimized Blueprints for deploying high performance clusters to train, fine tune, and host (inference) models on Amazon SageMaker HyperPod

Amazon SageMaker HyperPod now supports custom AMIs

Amazon SageMaker HyperPod now supports custom AMIs

Deploy clusters with pre-configured, security-hardened environments that meet organizational requirements. Custom AMIs enable faster startup times and consistent configurations across cluster nodes.

Read Article →
Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPodTraining Best Practices

Announcing Managed Tiered Checkpointing for Amazon SageMaker HyperPodTraining Best Practices

Train reliably on large-scale clusters with configurable checkpoint frequency across in-memory and persistent storage. Integrated with PyTorch's Distributed Checkpoint for easy implementation.

Read Article →
Amazon SageMaker HyperPod now supports autoscaling using Karpenter

Amazon SageMaker HyperPod now supports autoscaling using Karpenter

Automatically scale clusters to meet dynamic inference and training demands. Managed node autoscaling eliminates Karpenter setup overhead while providing integrated resilience and fault tolerance.

Read Article →
Amazon SageMaker AI now supports P6e-GB200 UltraServers

Amazon SageMaker AI now supports P6e-GB200 UltraServers

Deliver 20x compute and 11x memory performance with 360 petaflops of FP8 compute and 13.4 TB HBM3e memory. Combined with SageMaker's managed infrastructure and monitoring capabilities.

Read Article →
Resilient development environment

Remove interruptions with a resilient development environment

Automatically detects, diagnoses, and recovers from infrastructure faults. Run model development workloads continuously for months without interruption through intelligent fault management and self-healing capabilities.

State-of-the-art performance

Efficiently scale and parallelize model training across thousands of AI accelerators

Automatically splits models and datasets across AWS cluster instances for efficient scaling. Optimizes training jobs for AWS network infrastructure and cluster topology. Streamlines checkpointing with optimized frequency to minimize training overhead.

State-of-the-art performance

Achieve state-of-the-art performance with recipes and tools

Pre-built recipes enable rapid training and fine-tuning of generative AI models in minutes. Customize Amazon Nova foundation models for business-specific use cases while maintaining industry-leading performance. Built-in experimentation and observability tools help enhance model performance across all skill levels.

State-of-the-art performance

Reduce costs with centralized governance over all model development tasks

Provides full visibility and control over compute resource allocation for training and inference tasks. Automatically manages task queues, prioritizing critical work to meet deadlines and budgets. Efficient resource utilization reduces model development costs by up to 40%.

Learn with Video Tutorials

Watch these tutorials to master Amazon SageMaker HyperPod

Accelerate FM pre-training on Amazon SageMaker HyperPod (Amazon EKS)

Amazon SageMaker HyperPod is purpose-built to reduce time to train foundation models (FMs) by up to 40% and scale across more than a thousand AI accelerators efficiently. In this video, learn about Amazon EKS support in SageMaker HyperPod to accelerate your FM training.
Learn more at: https://go.aws/3TUKZSs

Accelerate FM pre-training on Amazon SageMaker HyperPod (Slurm)

Amazon SageMaker HyperPod is purpose-built to reduce time to train foundation models (FMs) by up to 40% and scale across more than a thousand AI accelerators efficiently. In this video, dive into how to run distributed training on SageMaker HyperPod.
Learn more at: https://go.aws/3TUKZSs

Get started with Amazon SageMaker HyperPod flexible training plans

Amazon SageMaker HyperPod helps you scale and accelerate generative AI model development. In this video, you will learn how to use the flexible training plans feature to run efficient model training that aligns with your timelines and budgets.
Learn more about Amazon SageMaker HyperPod - https://go.aws/3WwsBA3