Keep control of your HPC cost on AWS
Scale-Out Computing on AWS offers multiple ways to make sure you will stay within budget while running your HPC workloads on AWS
Limit who can submit jobs¶
Only allow specific individual users or/and LDAP groups to submit jobs. Refer to this page for examples and documentation
Limit what type of EC2 instance can be provisioned¶
Control what type of EC2 instances can be provisioned for any given queue. Refer to this page for examples and documentation
Accelerated Computing Instances
Unless required for your workloads, it's recommended to exclude "p2", "p3", "g2", "g3", "p3dn" or other GPU instances type.
Force jobs to run only on Reserved Instances¶
You can limit a job to run only on Reserved Instances if you specify
force_ri=True (Documentation) flag at job submission or for the entire queue.
Your job will stay in the queue if you do not have any Reserved Instance available.
Limit the number of concurrent jobs or provisioned instances¶
You can limit the number of concurrent running jobs or provisioned instances at the queue level. Edit
queue_mapping.yml, specify either
max_provisioned_instances to the limit you do not want to exceed.
queue_type: compute: queues: ["myqueue"] max_running_jobs: 5 max_provisioned_instances: 10
In this example, the maximum number of running job for "myqueue" will be 5. Similarly, jobs cannot request more than 10 instances (note: you can also limit the type/family of instances you want your user to provision)
These settings are independent so you can choose to either limit by # jobs, # instances, both or none.
Create a budget¶
Creating an AWS Budget will ensure jobs can't be submitted if the budget allocated to the team/queue/project has exceeded the authorized amount. Refer to this page for examples and documentation
Review your HPC cost in a central dashboard¶
Stay on top of your AWS costs in real time. Quickly visualize your overall usage and find answers to your most common questions:
Who are my top users?
How much money did we spend for Project A?
How much storage did we use for Queue B?
Where my money is going (storage, compute ...)
Assuming you are on-boarding a new team, here are our recommend best practices: