Create Optimized AMI to provision capacity faster

Create your SOCA Optimized AMI¶

By default, SOCA provision vanilla AMIs from the AWS Marketplace (Amazon Linux, RHEL, Centos, ROcky, Windows ...) and install all required packages. This process can take between 5 to 10 minutes depending if it's an HPC or Virtual Desktop nodes as well as which operating system you are using.

If this cold time is not acceptable for your workload, you can launch AlwaysOn instance or pre-bake your AMI with all required libraries.

Below is the baseline time for an EC2 machine to go through the entire SOCA Bootstrap sequence and be ready to serve job/virtual desktop:

Node Type	Setup Time with Standard AMI	Setup Time with SOCA Optimized AMI	Provisioning Speedup
HPC Node (Amazon Linux 2)	5 minutes	1 minute	80%
HPC Node (Amazon Linux 2023)	5 minutes	1 minute	80%
HPC Node (RHEL9)	8 minutes	1 minute	88%
HPC Node (RHEL8)	6 minutes	1 minute	83%
HPC Node (Rocky9)	7 minutes	1 minute	86%
HPC Node (Rocky8)	7 minutes	1 minute	86%
Virtual Desktop (Amazon Linux 2)	8 minutes	1 minute	88%
Virtual Desktop (RHEL9)	13 minutes	1 minute	92%
Virtual Desktop (RHEL8)	15 minutes	1 minute	93%
Virtual Desktop (Rocky9)	15 minutes	1 minute	93%
Virtual Desktop (Rocky8)	16 minutes	1 minute	94%
Virtual Desktop (Windows)	6 minutes	4 minutes ¹	33%

Info

Standard AMI: The standard vanilla AMI provided by AWS Marketplace.

SOCA Optimized AMI: A standard AMI with SOCA requirements (library, packages ...) pre-installed.

¹: You won't see much performance improvement with Windows Virtual destkop as most time-sensitive actions such as compiling scheduler/cache are not performed on Windows. However, as a Virtual Desktop machine, you will be able to restart a provisioned Desktop in ~1 minute.

Linux HPC Node¶

Manual Image Creation¶

To create a Linux HPC node optimized AMI, you must first launch a simple job that will run for the time of the AMI creation:

qsub -l base_os=amazonlinux2 -- /bin/tail -f /dev/null

Note

Change -l base_os to match the target operating system of your AMI (e.g: -l base_os=rhel9)

Click here for all supported Base OS

Wait until your job is running (R state) as shown below:

# Job is in `Q` state and not ready for the AMI Process. Wait a little longer
[socaadmin@ip-150-0-72-143 ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
0.ip-150-0-72-143 STDIN            socaadmin                0 Q normal

# Job is in `R` state, you can now proceed to the AMI generation
[socaadmin@ip-150-0-72-143 ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
0.ip-150-0-72-143 STDIN            socaadmin         00:00:00 R normal

Once your job is running, it's now time to create the AMI. Go to the EC2 console and select the EC2 machine dedicated to the job (the name of the instance will be <soca-$SOCA_CLUSTER_ID>-<JOB_ID>), then click Actions (1) > Image and templates (2) > Create Image (3)

Note

You can also retrieve the machine serving the current job by running qstat -f <jobid> | grep comment

# The IP of the EC2 instance we want to create an AMI from is 150.0.241.185
qstat -f 0  | grep comment
comment = Job run at Fri Dec 20 at 08:12 on (ip-150-0-241-185:ncpus=1)

This will open the "Create Image" wizard. Select an AMI Name/Description, then make sure to have unchecked 'Reboot Instance' checkbox. Click Create Image to proceed to the AMI creation.

You should see a confirmation message with your AMI ID:

At this point, your AMI is not yet created, and you must wait until the image status is Available. Time will vary based on the size of the AMI, it can be anywhere between 2 minutes and multiple hours.

Wait until your AMI is in available state.

Time to delete the previous job

Now that your AMI is ready, don't forget to delete the test job you have launched previously (qdel <jobid>) as the machine is no longer needed.

That's it! You can now test your custom AMI by using -l instance_ami parameter.

# Make sure to adjust base_os based on the operating system of the AMI
qsub -l instance_ami=ami-04b5fb10f28ea81ba -l base_os=amazonlinux2 -- /bin/sleep 600

You can compare the provisioning time by looking at the logs under /opt/soca/<SOCA_CLUSTER_ID>/cluster_node_boostrap/logs/compute_node/<JOB_ID>/<HOST>

# Job ID 0 is using the Standard AMI. This is the base job we have launched and created an optimized AMI From
# The entire bootstrap sequence was ~5 minutes (start 08:06, end 08:11)
ls -ltr /opt/soca/soca-clustertest/cluster_node_bootstrap/logs/compute_node/0/ip-150-0-241-185/
-rw------- 1 root root 146325 Dec 20 08:06 messages
-rw-r----- 1 root root  46185 Dec 20 08:06 cloud-init-output.log
-rw-r--r-- 1 root root  92938 Dec 20 08:06 cloud-init.log
-rw-r--r-- 1 root root 716720 Dec 20 08:11 02_setup.log
-rw-r--r-- 1 root root   1020 Dec 20 08:11 03_setup_post_reboot.log

# Job ID 1 is using the SOCA Optimized AMI created from Job ID 0.
# The entire bootstrap sequence was ~1 minute (start 08:34, end 08:35)
ls -ltr /opt/soca/soca-clustertest/cluster_node_bootstrap/logs/compute_node/1/ip-150-0-164-125/
-rw------- 1 root root 431212 Dec 20 08:34 messages
-rw-r--r-- 1 root root 273657 Dec 20 08:34 cloud-init.log
-rw-r----- 1 root root  85572 Dec 20 08:34 cloud-init-output.log
-rw-r--r-- 1 root root  16032 Dec 20 08:35 02_setup.log
-rw-r--r-- 1 root root   1127 Dec 20 08:35 03_setup_post_reboot.log

Automated Image Creation¶

You can also use AWS API to create an image automatically using awscli or boto3.

There are multiple ways to achieve this, here is one example where you submit a job that will create the AMI automatically based on the EC2 machine specs provisioned for this job.

Prior to do that, you must give some additional permissions to the SOCA compute nodes.

Important

This operation will grant the ability for all compute nodes to create EC2 Machine Image.

Permissions are limited to EC2 machines running on your current SOCA environment and having compute_node NodeType.

This permission will give permissions to:

Create an AMI for any SOCA Compute Node (node serving HPC job) running on your current SOCA environment

This permission won't give permissions to:

Create an image of an EC2 instance not managed by your current SOCA environment
Create an image of the Controller or Login Node
Interact with any existing EC2 image hosted on your AWS account

Make sure to confirm this IAM permissions adhere to your company security/compliance requirements before implementing it.

IAM Permission (click to expand)

First, you must give the ComputeNodeRole IAM role the required permissions to execute the AMI creation. Go to IAM console and locate the ComputeNodeRole IAM role associated to your cluster ID. The naming convention is SOCA_CLUSTER_ID-ComputeNodeRole<UniqueID>-<UniqueID> (example: soca-demoenv-ComputeNodeRole7A9ECFBB-HnkXDljR91iE)

Add the following IAM policy to the ComputeNodeRole, make sure to replace REPLACE_WITH_YOUR_SOCA_CLUSTER_ID with your SOCA_CLUSTER_ID value (e.g: soca-myenv)

This policy will allow compute nodes to create an AMI only:

If the target EC2 instance is part of the same SOCA environment
If the target EC2 instance is a compute_node

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Perm1",
            "Effect": "Allow",
            "Action": [
                "ec2:CreateTags",
                "ec2:CreateImage"
            ],
            "Resource": [
                "arn:aws:ec2:*::image/*",
                "arn:aws:ec2:*::snapshot/*"
            ]
        },
        {
            "Sid": "Perm2",
            "Effect": "Allow",
            "Action": "ec2:CreateSnapshot",
            "Resource": "*"
        },
        {
            "Sid": "Perm3",
            "Effect": "Allow",
            "Action": "ec2:CreateImage",
            "Resource": "arn:aws:ec2:*:*:instance/*",
            "Condition": {
                "StringEquals": {
                    "ec2:ResourceTag/soca:ClusterId": "<REPLACE_WITH_YOUR_SOCA_CLUSTER_ID>",
                    "ec2:ResourceTag/soca:NodeType": "compute_node"
                }
            }
        }
    ]
}

Once you have applied the IAM permission to your compute nodes, create the script below (e.g: call it automated_ami_creation.sh) and submit a job as you would normally do. This script will register an AMI from the current machine automatically.

#!/bin/bash

#PBS -N SOCAAmiCreationRequest

# Adjust limit as needed based on time of the AMI to become available
# (default: 60 minutes)
MAX_AMI_VERIFICATION_LOOPS=60
SECONDS_BETWEEN_AMI_VERIFICATION=60 
# Retrieve IMDS Token
TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

echo "Received IMDS Token: ${TOKEN}"

# Retrieve current EC2 Instance ID from the IMDS service
INSTANCE_ID=$(curl -s  -H "X-aws-ec2-metadata-token: ${TOKEN}" http://169.254.169.254/latest/meta-data/instance-id)

echo "Detected Instance ID: ${INSTANCE_ID}"
# Submit the AMI creation request
AMI_ID=$(aws ec2 create-image \
    --instance-id ${INSTANCE_ID} \
    --name "SOCA_AMI_JOB_${PBS_JOBID}" \
    --description "SOCA AMI Creation generated from JOB ID ${PBS_JOBID}" \
    --no-reboot \
    --output text
)
echo "AMI Creation Request Output: ${AMI_ID}"

if [[ $? -ne 0 ]]; then
    echo "Unable to generate AMI. Make sure you have setup the correct IAM permissions. See Log Error"
    exit 1
fi

# Wait until the AMI is available
for ((ATTEMPT=0; i<=MAX_AMI_VERIFICATION_LOOPS; ATTEMPT++)); do
    echo "Attempt ${ATTEMPT} of ${MAX_AMI_VERIFICATION_LOOPS}: Checking AMI status..."

    # Get the current state of the AMI
    STATE=$(aws ec2 describe-images --image-ids "${AMI_ID}" --query "Images[0].State" --output text)

    # Check if the state is 'available'
    if [[ "${STATE}" == "available" ]]; then
        echo "AMI $AMI_ID is now available!"
        exit 0
    fi

    # Print the current state
    echo "Current state: ${STATE}"

    # Wait before the next attempt
    if [[ ${ATTEMPT} -lt ${MAX_AMI_VERIFICATION_LOOPS} ]]; then
        echo "Waiting for ${SECONDS_BETWEEN_AMI_VERIFICATION} seconds before checking again..."
        sleep ${SECONDS_BETWEEN_AMI_VERIFICATION}
    else
        # Exit if AMI has not been verified after limit
        echo "Unable to verify AMI after ${MAX_AMI_VERIFICATION_LOOPS} attempt. Check the EC2 Image console for more info."
        echo "This does not mean the AMI request failed, but the AMI is still being created after the timeout. Large AMI can take several hours to complete"
        exit 1
    fi
done

exit 0

Submit a job as you would normally do, here is an example if you want to create a RHEL8 based AMI

qsub -l base_os=rhel8 automated_ami_creation.sh

Wait until the job finish to get your AMI ID by checking the logs.

# Job is in "Q" state, meaning the EC2 capacity is still being provisioned

[socaadmin@ip-159-0-66-95 ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
10.ip-159-0-66-95  SOCAAmiCreation* socaadmin                0 Q normal

# Job is in "R" state, meaning the EC2 Image creation is in progress
# This can take a multiple hours based on the EC2 instance size

[socaadmin@ip-159-0-66-95 ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
10.ip-159-0-66-95  SOCAAmiCreation* socaadmin         00:00:08 R normal

Once your job is completed, check the stdout logs for your job to retrieve the AMI ID. If you use the example below:

stdout log: SOCAAmiCreationRequest.o<JOB_ID>
stderr log: SOCAAmiCreationRequest.e<JOB_ID>

cat SOCAAmiCreationRequest.o10

Received IMDS Token: AQAEAALuWaWN6ggGUegeTFljo4VHxj0kQ65yL6KaHMitfrjdTPPoxw==
Detected Instance ID: i-02a3a8658a715b118
AMI Creation Request Output: ami-04f15e3b2725ea402
Attempt 0 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 1 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 2 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 3 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 4 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 5 of 60: Checking AMI status...
Current state: pending
Waiting for 60 seconds before checking again...
Attempt 6 of 60: Checking AMI status...
AMI ami-04f15e3b2725ea402 is now available!

Optionally, you can navigate to the EC2 AMI console and confirm the AMI you just created (ami-04f15e3b2725ea402) is there:

Note

The provisioned EC2 capacity will automatically be terminated once your job is terminated.

Update default AMI (Optional)¶

Single job¶

As you are planning to use a custom AMI, you will be required to specify -l instance_ami=<IMAGE_ID> at job submission. It's recommended to go with the "Entire Queue" option below if you do not want to manually specify this resource each time you submit a job.

Entire queue¶

Edit /opt/soca/${SOCA_CLUSTER_ID}/cluster_manager/orchestrator/settings/queue_mapping.yml and update the default AMI

queue_type:
  compute:
    queues: ["queue1", "queue2", "queue3"] 
    instance_ami: "<YOUR_AMI_ID>" # <- Add your new AMI 
    instance_type: ...

Any jobs running in the queue configured on the queue_mapping will now use your pre-configured AMI by default. You do not need to specify -l instance_ami at job submission anymore.

Prevent users to specify a custom AMI¶

By default, SOCA users can use any available AMI using -l instance_ami job parameter.

You add instance_ami as a restricted parameter to ensure user won't be able to use any AMI that has not been validated by SOCA admins.

Linux or Windows Virtual Desktop Node¶

Refer to this page to learn how to create SOCA Optimized AMI for your Linux or Windows Virtual Desktop