Skip to main content

Receive cluster status/health events

Integration with Amazon EventBridge

SageMaker HyperPod delivers two types of notifications through Amazon EventBridge:

  1. Cluster status change events
  2. Node health events

This guide provides instructions for setting up human-readable email notifications for these events using AWS Lambda and Amazon Simple Email Service (SES).

Setup email notifications

1. Decide email addresses

Firstly, determine the sender's email address and receiver's email address.

2. Create and verify email identities in SES

Visit the management console of Amazon SES, create email identities (for both sender address and receiver address) from the "Create identity" button.

SES Console

A confirmation email will be sent to your email address. Click on the link to verify your email address and make sure your "Identity status" changes to "Verified".

3. Deploy the CloudFormation template

Click the button below to deploy the CloudFormation stack, which will install the EventBridge rule, Lambda function, and required IAM roles.

Deploy EventBridge Email Stack

4. Verify

Verify that you can receive notification emails by changing the cluster status (e.g., scaling up/down). You can also test node health notifications by triggering manuall instance replacement.

Node Health Email

Troubleshooting

If you don't receive the email, please check whether it is being classified as spam, and monitor the graphs in the EventBridge console and Lambda console to see if any errors are occurring. If your Lambda function is failing, check CloudWatch Logs for the reason for the failure.

Next steps

  • You can customize the format of the emails by modifying the Lambda function. You can extract more information from the event JSON data, or even by calling SageMaker AI service APIs.
  • If you used Amazon SES for the first time, your SES account is most likely in the sandbox mode. Some restrictions are applied to the sandbox mode. See this document to learn more about the sandbox mode, and how to moved out of the sandbox and into production.