Create Optimized AMI to provision capacity faster
Create your SOCA Optimized AMI¶
By default, SOCA provision vanilla AMIs from the AWS Marketplace (Amazon Linux, RHEL, Centos, ROcky, Windows ...) and install all required packages. This process can take between 5 to 10 minutes depending if it's an HPC or Virtual Desktop nodes as well as which operating system you are using.
If this cold time is not acceptable for your workload, you can launch AlwaysOn instance or pre-bake your AMI with all required libraries.
Below is the baseline time for an EC2 machine to go through the entire SOCA Bootstrap sequence and be ready to serve job/virtual desktop:
Node Type | Setup Time with Standard AMI | Setup Time with SOCA Optimized AMI | Provisioning Speedup |
---|---|---|---|
HPC Node (Amazon Linux 2) | 5 minutes | 1 minute | 80% |
HPC Node (Amazon Linux 2023) | 5 minutes | 1 minute | 80% |
HPC Node (RHEL9) | 8 minutes | 1 minute | 88% |
HPC Node (RHEL8) | 6 minutes | 1 minute | 83% |
HPC Node (Rocky9) | 7 minutes | 1 minute | 86% |
HPC Node (Rocky8) | 7 minutes | 1 minute | 86% |
Virtual Desktop (Amazon Linux 2) | 8 minutes | 1 minute | 88% |
Virtual Desktop (RHEL9) | 13 minutes | 1 minute | 92% |
Virtual Desktop (RHEL8) | 15 minutes | 1 minute | 93% |
Virtual Desktop (Rocky9) | 15 minutes | 1 minute | 93% |
Virtual Desktop (Rocky8) | 16 minutes | 1 minute | 94% |
Virtual Desktop (Windows) | 6 minutes | 4 minutes 1 | 33% |
Info
Standard AMI: The standard vanilla AMI provided by AWS Marketplace.
SOCA Optimized AMI: A standard AMI with SOCA requirements (library, packages ...) pre-installed.
1: You won't see much performance improvement with Windows Virtual destkop as most time-sensitive actions such as compiling scheduler/cache are not performed on Windows. However, as a Virtual Desktop machine, you will be able to restart a provisioned Desktop in ~1 minute.
Linux HPC Node¶
Manual Image Creation¶
To create a Linux HPC node optimized AMI, you must first launch a simple job that will run for the time of the AMI creation:
qsub -l base_os=amazonlinux2 -- /bin/tail -f /dev/null
Note
Change -l base_os
to match the target operating system of your AMI (e.g: -l base_os=rhel9
)
Wait until your job is running (R
state) as shown below:
# Job is in `Q` state and not ready for the AMI Process. Wait a little longer
[socaadmin@ip-150-0-72-143 ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
0.ip-150-0-72-143 STDIN socaadmin 0 Q normal
# Job is in `R` state, you can now proceed to the AMI generation
[socaadmin@ip-150-0-72-143 ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
0.ip-150-0-72-143 STDIN socaadmin 00:00:00 R normal
Once your job is running, it's now time to create the AMI. Go to the EC2 console and select the EC2 machine dedicated to the job (the name of the instance will be <soca-$SOCA_CLUSTER_ID>-<JOB_ID>
), then click Actions (1) > Image and templates (2) > Create Image (3)
Note
You can also retrieve the machine serving the current job by running qstat -f <jobid> | grep comment
# The IP of the EC2 instance we want to create an AMI from is 150.0.241.185
qstat -f 0 | grep comment
comment = Job run at Fri Dec 20 at 08:12 on (ip-150-0-241-185:ncpus=1)
This will open the "Create Image" wizard. Select an AMI Name/Description, then make sure to have unchecked 'Reboot Instance' checkbox. Click Create Image to proceed to the AMI creation.
You should see a confirmation message with your AMI ID:
At this point, your AMI is not yet created, and you must wait until the image status is Available. Time will vary based on the size of the AMI, it can be anywhere between 2 minutes and multiple hours.
Wait until your AMI is in available state.
Time to delete the previous job
Now that your AMI is ready, don't forget to delete the test job you have launched previously (qdel <jobid>
) as the machine is no longer needed.
That's it! You can now test your custom AMI by using -l instance_ami
parameter.
# Make sure to adjust base_os based on the operating system of the AMI
qsub -l instance_ami=ami-04b5fb10f28ea81ba -l base_os=amazonlinux2 -- /bin/sleep 600
You can compare the provisioning time by looking at the logs under /apps/soca/<SOCA_CLUSTER_ID>/cluster_node_boostrap/logs/compute_node/<JOB_ID>/<HOST>
# Job ID 0 is using the Standard AMI. This is the base job we have launched and created an optimized AMI From
# The entire bootstrap sequence was ~5 minutes (start 08:06, end 08:11)
ls -ltr /apps/soca/soca-clustertest/cluster_node_bootstrap/logs/compute_node/0/ip-150-0-241-185/
-rw------- 1 root root 146325 Dec 20 08:06 messages
-rw-r----- 1 root root 46185 Dec 20 08:06 cloud-init-output.log
-rw-r--r-- 1 root root 92938 Dec 20 08:06 cloud-init.log
-rw-r--r-- 1 root root 716720 Dec 20 08:11 02_setup.log
-rw-r--r-- 1 root root 1020 Dec 20 08:11 03_setup_post_reboot.log
# Job ID 1 is using the SOCA Optimized AMI created from Job ID 0.
# The entire bootstrap sequence was ~1 minute (start 08:34, end 08:35)
ls -ltr /apps/soca/soca-clustertest/cluster_node_bootstrap/logs/compute_node/1/ip-150-0-164-125/
-rw------- 1 root root 431212 Dec 20 08:34 messages
-rw-r--r-- 1 root root 273657 Dec 20 08:34 cloud-init.log
-rw-r----- 1 root root 85572 Dec 20 08:34 cloud-init-output.log
-rw-r--r-- 1 root root 16032 Dec 20 08:35 02_setup.log
-rw-r--r-- 1 root root 1127 Dec 20 08:35 03_setup_post_reboot.log
Automated Image Creation¶
You can also use AWS API to create an image automatically using awscli
or boto3
.
There are multiple ways to achieve this, here is one example where you submit a job that will create the AMI automatically based on the EC2 machine specs provisioned for this job.
Prior to do that, you must give some additional permissions to the SOCA compute nodes.
Important
This operation will grant the ability for all compute nodes to create EC2 Machine Image.
Permissions are limited to EC2 machines running on your current SOCA environment and having compute_node
NodeType.
This permission will give permissions to:
- Create an AMI for any SOCA Compute Node (node serving HPC job) running on your current SOCA environment
This permission won't give permissions to:
- Create an image of an EC2 instance not managed by your current SOCA environment
- Create an image of the Controller or Login Node
- Interact with any existing EC2 image hosted on your AWS account
Make sure to confirm this IAM permissions adhere to your company security/compliance requirements before implementing it.
IAM Permission (click to expand)
First, you must give the ComputeNodeRole
IAM role the required permissions to execute the AMI creation. Go to IAM console and locate the ComputeNodeRole
IAM role associated to your cluster ID. The naming convention is SOCA_CLUSTER_ID-ComputeNodeRole<UniqueID>-<UniqueID>
(example: soca-demoenv-ComputeNodeRole7A9ECFBB-HnkXDljR91iE
)
Add the following IAM policy to the ComputeNodeRole
, make sure to replace REPLACE_WITH_YOUR_SOCA_CLUSTER_ID
with your SOCA_CLUSTER_ID value (e.g: soca-myenv
)
This policy will allow compute nodes to create an AMI only:
- If the target EC2 instance is part of the same SOCA environment
- If the target EC2 instance is a
compute_node
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Perm1",
"Effect": "Allow",
"Action": [
"ec2:CreateTags",
"ec2:CreateImage"
],
"Resource": [
"arn:aws:ec2:*::image/*",
"arn:aws:ec2:*::snapshot/*"
]
},
{
"Sid": "Perm2",
"Effect": "Allow",
"Action": "ec2:CreateSnapshot",
"Resource": "*"
},
{
"Sid": "Perm3",
"Effect": "Allow",
"Action": "ec2:CreateImage",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"StringEquals": {
"ec2:ResourceTag/soca:ClusterId": "<REPLACE_WITH_YOUR_SOCA_CLUSTER_ID>",
"ec2:ResourceTag/soca:NodeType": "compute_node"
}
}
}
]
}
Once you have applied the IAM permission to your compute nodes, create the script below (e.g: call it automated_ami_creation.sh
) and submit a job as you would normally do. This script will register an AMI from the current machine automatically.
#!/bin/bash
#PBS -N SOCAAmiCreationRequest
# Adjust limit as needed based on time of the AMI to become available
# (default: 60 minutes)
MAX_AMI_VERIFICATION_LOOPS=60
SECONDS_BETWEEN_AMI_VERIFICATION=60
# Retrieve IMDS Token
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
echo "Received IMDS Token: ${TOKEN}"
# Retrieve current EC2 Instance ID from the IMDS service
INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: ${TOKEN}" http://169.254.169.254/latest/meta-data/instance-id)
echo "Detected Instance ID: ${INSTANCE_ID}"
# Submit the AMI creation request
AMI_ID=$(aws ec2 create-image \
--instance-id ${INSTANCE_ID} \
--name "SOCA_AMI_JOB_${PBS_JOBID}" \
--description "SOCA AMI Creation generated from JOB ID ${PBS_JOBID}" \
--no-reboot \
--output text
)
echo "AMI Creation Request Output: ${AMI_ID}"
if [[ $? -ne 0 ]]; then
echo "Unable to generate AMI. Make sure you have setup the correct IAM permissions. See Log Error"
exit 1
fi
# Wait until the AMI is available
for ((ATTEMPT=0; i<=MAX_AMI_VERIFICATION_LOOPS; ATTEMPT++)); do
echo "Attempt ${ATTEMPT} of ${MAX_AMI_VERIFICATION_LOOPS}: Checking AMI status..."
# Get the current state of the AMI
STATE=$(aws ec2 describe-images --image-ids "${AMI_ID}" --query "Images[0].State" --output text)
# Check if the state is 'available'
if [[ "${STATE}" == "available" ]]; then
echo "AMI $AMI_ID is now available!"
exit 0
fi
# Print the current state
echo "Current state: ${STATE}"
# Wait before the next attempt
if [[ ${ATTEMPT} -lt ${MAX_AMI_VERIFICATION_LOOPS} ]]; then
echo "Waiting for ${SECONDS_BETWEEN_AMI_VERIFICATION} seconds before checking again..."
sleep ${SECONDS_BETWEEN_AMI_VERIFICATION}
else
# Exit if AMI has not been verified after limit
echo "Unable to verify AMI after ${MAX_AMI_VERIFICATION_LOOPS} attempt. Check the EC2 Image console for more info."
echo "This does not mean the AMI request failed, but the AMI is still being created after the timeout. Large AMI can take several hours to complete"
exit 1
fi
done
exit 0
Submit a job as you would normally do, here is an example if you want to create a RHEL8 based AMI
qsub -l base_os=rhel8 automated_ami_creation.sh
Wait until the job finish to get your AMI ID by checking the logs.
# Job is in "Q" state, meaning the EC2 capacity is still being provisioned
[socaadmin@ip-159-0-66-95 ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
10.ip-159-0-66-95 SOCAAmiCreation* socaadmin 0 Q normal
# Job is in "R" state, meaning the EC2 Image creation is in progress
# This can take a multiple hours based on the EC2 instance size
[socaadmin@ip-159-0-66-95 ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
10.ip-159-0-66-95 SOCAAmiCreation* socaadmin 00:00:08 R normal
Once your job is completed, check the stdout logs for your job to retrieve the AMI ID. If you use the example below:
- stdout log:
SOCAAmiCreationRequest.o<JOB_ID>
- stderr log:
SOCAAmiCreationRequest.e<JOB_ID>
cat SOCAAmiCreationRequest.o10
Received IMDS Token: AQAEAALuWaWN6ggGUegeTFljo4VHxj0kQ65yL6KaHMitfrjdTPPoxw==
Detected Instance ID: i-02a3a8658a715b118
AMI Creation Request Output: ami-04f15e3b2725ea402
Attempt 0 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 1 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 2 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 3 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 4 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 5 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 6 of 60: Checking AMI status...
AMI ami-04f15e3b2725ea402 is now available!
Optionally, you can navigate to the EC2 AMI console and confirm the AMI you just created (ami-04f15e3b2725ea402
) is there:
Note
The provisioned EC2 capacity will automatically be terminated once your job is terminated.
Update default AMI (Optional)¶
Single job¶
As you are planning to use a custom AMI, you will be required to specify -l instance_ami=<IMAGE_ID>
at job submission.
It's recommended to go with the "Entire Queue" option below if you do not want to manually specify this resource each time you submit a job.
Entire queue¶
Edit /apps/soca/${SOCA_CLUSTER_ID}/cluster_manager/orchestrator/settings/queue_mapping.yml
and update the default AMI
queue_type:
compute:
queues: ["queue1", "queue2", "queue3"]
instance_ami: "<YOUR_AMI_ID>" # <- Add your new AMI
instance_type: ...
Any jobs running in the queue configured on the queue_mapping
will now use your pre-configured AMI by default. You do not need to specify -l instance_ami
at job submission anymore.
Prevent users to specify a custom AMI¶
By default, SOCA users can use any available AMI using -l instance_ami
job parameter.
You add instance_ami
as a restricted parameter to ensure user won't be able to use any AMI that has not been validated by SOCA admins.
Linux Virtual Desktop Node¶
Refer to this page to learn how to create SOCA Optimized AMI for your Linux Virtual Desktop
Windows Virtual Desktop Node¶
Refer to this page to learn how to create SOCA Optimized AMI for your Windows Virtual Desktop