SageMaker HyperPod Monitoring with OS Grafana

Reference Documentation

For production observability setup, see Observability on Slurm.

For this workshop, we use an EC2 instance running an OS Grafana container along with Amazon Managed Service for Prometheus workspace. These resources are deployed using the cluster-observability-os-grafana.yaml CloudFormation template.

Deploy the Observability Stack

Download and deploy the CloudFormation template:

curl -O https://raw.githubusercontent.com/awslabs/awsome-distributed-ai/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml

aws cloudformation create-stack \
  --stack-name os-observability \
  --template-body file://cluster-observability-os-grafana.yaml \
  --capabilities CAPABILITY_IAM \
  --region us-west-2

Wait for the stack to reach CREATE_COMPLETE before continuing:

aws cloudformation wait stack-create-complete --stack-name os-observability --region us-west-2

note

In a production HyperPod Slurm deployment, customers typically leverage Amazon Managed Grafana instead of OS-Grafana. CloudFormation templates located in the AWSome Distributed AI repository will orchestrate the deployment of monitoring resources.

Resources Deployed

Amazon Managed Prometheus Workspace
Amazon Managed Grafana Workspace (or OS Grafana for workshops)
Associated IAM roles and permissions

Deploy the Observability Stack​

Resources Deployed​

Deploy the Observability Stack

Resources Deployed