SageMaker HyperPod Monitoring with OS Grafana
Reference Documentation
For production observability setup, see Observability on Slurm.
For this workshop, we use an EC2 instance running an OS Grafana container along with Amazon Managed Service for Prometheus workspace. These resources are deployed using the cluster-observability-os-grafana.yaml CloudFormation template.
Deploy the Observability Stack
Download and deploy the CloudFormation template:
curl -O https://raw.githubusercontent.com/awslabs/awsome-distributed-ai/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml
aws cloudformation create-stack \
--stack-name os-observability \
--template-body file://cluster-observability-os-grafana.yaml \
--capabilities CAPABILITY_IAM \
--region us-west-2
Wait for the stack to reach CREATE_COMPLETE before continuing:
aws cloudformation wait stack-create-complete --stack-name os-observability --region us-west-2
note
In a production HyperPod Slurm deployment, customers typically leverage Amazon Managed Grafana instead of OS-Grafana. CloudFormation templates located in the AWSome Distributed AI repository will orchestrate the deployment of monitoring resources.
Resources Deployed
- Amazon Managed Prometheus Workspace
- Amazon Managed Grafana Workspace (or OS Grafana for workshops)
- Associated IAM roles and permissions