What is Scale-Out Computing on AWS ?
Scale-Out Computing on AWS is a solution that helps customers more easily deploy and operate a multiuser environment for computationally intensive workflows. The solution features a large selection of compute resources; fast network backbone; unlimited storage; and budget and cost management directly integrated within AWS. The solution also deploys a user interface (UI) and automation tools that allows you to create your own queues, scheduler resources, Amazon Machine Images (AMIs), software, and libraries. This solution is designed to provide a production ready reference implementation to be a starting point for deploying an AWS environment to run scale-out workloads, allowing you to focus on running simulations designed to solve complex computational problems.
Installation of your Scale-Out Computing on AWS cluster is fully automated and managed by CloudFormation
Did you know?
- You can have multiple Scale-Out Computing on AWS clusters on the same AWS account
- Scale-Out Computing on AWS comes with a list of unique tags, making resource tracking easy for AWS Administrators
Access your cluster in 1 click¶
Simple Job Submission¶
Scale-Out Computing on AWS supports a list of parameters designed to simplify your job submission on AWS. Advanced users can either manually choose compute/storage/network configuration for their job or simply ignore these parameters and let Scale-Out Computing on AWS picks the most optimal hardware (defined by the HPC administrator)
# Advanced Configuration user@host$ qsub -l instance_type=c5n.18xlarge \ -l instance_ami=ami-123abcde -l nodes=2 -l scratch_size=300 -l efa_support=true -l spot_price=1.55 myscript.sh # Basic Configuration user@host$ qsub myscript.sh
OS agnostic and support for custom AMI¶
Customers can integrate their Centos7/Rhel7/AmazonLinux2 AMI automatically by simply using -l instance_ami=<ami_id> at job submission. There is no limitation in term of AMI numbers (you can have 10 jobs running simultaneously using 10 different AMIs). SOCA supports heterogeneous environment, so you can have concurrent jobs running different operating system on the same cluster.
AMI using OS different than the scheduler
In case your AMI is different than your scheduler host, you can specify the OS manually to ensure packages will be installed based on the node distribution.
In this example, we assume your Scale-Out Computing on AWS deployment was done using AmazonLinux2, but you want to submit a job on your personal RHEL7 AMI
user@host$ qsub -l instance_ami=<ami_id> -l base_os=rhel7 myscript.sh
Scale-Out Computing on AWS AMI requirements
When you use a custom AMI, just make sure that your AMI does not use /apps, /scratch or /data partitions as Scale-Out Computing on AWS will need to use these locations during the deployment. Read this page for AMI creation best practices
Web User Interface¶
Scale-Out Computing on AWS includes a simple web ui designed to simplify user interactions such as:
- Start/Stop DCV sessions in 1 click
- Download private key in both PEM or PPK format
- Check the queue and job status in real-time
- Add/Remove LDAP users
- Access the analytic dashboard
- Access your filesystem
- Understand why your jobs are stuck in the queue
- Create Application profiles and let your users submit job directly via the web interface
HTTP Rest API¶
Users can submit/retrieve/delete jobs remotely via an HTTP REST API
Budgets and Cost Management¶
You can review your HPC costs filtered by user/team/project/queue very easily using AWS Cost Explorer.
Scale-Out Computing on AWS also supports AWS Budget and let you create budgets assigned to user/team/project or queue. To prevent over-spend, Scale-Out Computing on AWS includes hooks to restrict job submission when customer-defined budget has expired.
Lastly, Scale-Out Computing on AWS let you create queue ACLs or instance restriction at a queue level. Refer to this link for all best practices in order to control your HPC cost on AWS and prevent overspend.
Detailed Cluster Analytics¶
Scale-Out Computing on AWS includes OpenSearch (formerly Elasticsearch) and automatically ingest job and hosts data in real-time for accurate visualization of your cluster activity.
Don't know where to start?
Scale-Out Computing on AWS includes dashboard examples if you are not familiar with OpenSearch (formerly Elasticsearch) or Kibana.
Scale-Out Computing on AWS is built entirely on top of AWS and can be customized by users as needed. Most of the logic is based of CloudFormation templates, shell scripts and python code. More importantly, the entire Scale-Out Computing on AWS codebase is open-source and available on Github.
Persistent and Unlimited Storage¶
Scale-Out Computing on AWS includes two unlimited EFS storage (/apps and /data). Customers also have the ability to deploy high-speed SSD EBS disks or FSx for Lustre as scratch location on their compute nodes. Refer to this page to learn more about the various storage options offered by Scale-Out Computing on AWS
Customers can create unlimited LDAP users and groups. By default Scale-Out Computing on AWS includes a default LDAP account provisioned during installation as well as a "Sudoers" LDAP group which manage SUDO permission on the cluster.
Scale-Out Computing on AWS automatically backup your data with no additional effort required on your side.
Support for network licenses¶
Scale-Out Computing on AWS includes a FlexLM-enabled script which calculate the number of licenses for a given features and only start the job/provision the capacity when enough licenses are available.
Automatic Errors Handling¶
Scale-Out Computing on AWS performs various dry run checks before provisioning the capacity. However, it may happen than AWS can't fullfill all requests (eg: need 5 instances but only 3 can be provisioned due to capacity shortage within a placement group). In this case, Scale-Out Computing on AWS will try to provision the capacity for 30 minutes. After 30 minutes, and if the capacity is still not available, Scale-Out Computing on AWS will automatically reset the request and try to provision capacity in a different availability zone. To simplify troubleshooting, all these errors are reported on the web interface
Each user is given a score which vary based on:
- Number of job in the queue
- Time each job is queued
- Priority of each job
- Type of instance
Job that belong to the user with the highest score will start next. Fair Share is is configured at the queue level (so you can have one queue using FIFO and another one Fair Share)
And more ...¶
Refer to the various sections (tutorial/security/analytics ...) to learn more about this solution