HPC (compute) Node
The HPC node (also known as Compute Node) is a machine deployed for HPC jobs. They are ephemeral nodes unless provisioned behind an alwayson
queue.
Bootstrap Code:¶
user_data.sh¶
Important
User Data is limited to 16kb
size. It's not recommended to change the content of this file.
This is the default EC2 User Data generated at instance launch via /apps/soca/CLUSTER_ID/cluster_manager/orchestrator/cloudformation_builder.py
and is located inside the /apps/soca/CLUSTER_ID/cluster_node_bootstrap/compute_node/
folder.
This script only prepare the machine by installing/upgrading awscli
and performing some filesystem operations.
setup.sh¶
This file is responsible for the main setup phase of your node. Template is located on /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node
.
This file can be updated post-SOCA deployment.
setup_post_reboot.sh¶
This file is responsible for the main setup phase and is executed after the first reboot triggered by setup.sh
. Template is located on /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node
.
This file can be updated post-SOCA deployment.
setup_user_customization.sh¶
This file is the perfect place if you want to add your own set of configuration while not touching the existing node bootstrap sequence.
Template is located on /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node
. You can update this file post-SOCA deployment.
Bootstrap Sequence¶
sequenceDiagram
autonumber
SOCA User->>cloudformation_builder.py: Submit a new HPC job/simulation
loop
cloudformation_builder.py->>cloudformation_builder.py: Render UserData from user_data.sh.j2
cloudformation_builder.py->>cloudformation_builder.py: Render setup.sh script from setup.sh.j2
cloudformation_builder.py->>cloudformation_builder.py: Render setup_post_reboot.sh script from setup_post_reboot.sh.j2
cloudformation_builder.py->>cloudformation_builder.py: Render setup_user_customizations.sh script from setup_user_customizations.sh.j2
end
Note right of cloudformation_builder.py: Rendered templates are stored on cluster_node_boostrap/logs/compute_node/<job_id>/
cloudformation_builder.py->>EC2 HPC Node: Launch EC2 HPC Node(s) and assign UserData
loop
EC2 HPC Node->>EC2 HPC Node: Execute UserData
EC2 HPC Node->>EC2 HPC Node: Execute setup.sh script
EC2 HPC Node->>EC2 HPC Node: Execute setup_post_reboot.sh script
EC2 HPC Node->>EC2 HPC Node: Execute setup_user_customizations.sh script
end
Info
Templates are rendered using Jinja2 and stored on the filesystent
Note
Unlike HPC/Virtual Desktop nodes, there is no PostReboot
actions for login nodes, as actions performed via /apps/soca/CLUSTER_ID/cluster_node_bootstrap/compute_node/setup_post_reboot.sh
are specific to HPC jobs of Virtual destkop nodes.
Bootstrap Flow¶
graph TD;
A[SOCA EC2 Node Provisioned] --> B{Architecture?};
B-- x86_64 -->C[Install x86_64 packages];
B-- aarch64 -->D[Install aarch64 packages];
C-->E[Common scripts];
D-->E;
E-->F[HPC Node Ready]
Note
HPC Node is the default Compute Node for SOCA, meaning no extra packages are installed.
Capacity Provisioning¶
Here is the entire capacity provisioned flow after a job is submitted to the scheduler queue:
graph TD;
A[Job is sent to the queue] --> B[Job Queued];
B--every minute-->C[SOCA retrieve current queue status via **dispatcher.py**];
C-->D[SOCA determine hardware requirement per job via **dispatcher.py**];
D-->E[SOCA Launch CloudFormation stack via **add_nodes.py**];
E-->F{Capacity is provisioned?};
F--noo-->G[Wait];
G-->F;
F--yes-->H[Nodes are added to OpenPBS via **nodes_manager.py**];
H-->I{Nodes are ready?};
I--noo-->J[Wait];
J-->I;
I--yes-->K[Job Start];
K-->L{Job completed?};
L--noo-->M[Wait];
M-->L;
L--yes-->N[Job is removed from the queue];
N-->O[Cloudformation Stack is deleted via **dispatcher.py**];
Customize HPC Node Bootstrap Code¶
We recommend you to add your customizations to /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node/setup_user_customization.sh.j2
. Alternatively, you can update /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node/setup.sh.j2
.
This file is common to HPC nodes, Virtual Desktop nodes and Login Nodes. Use the following condition if you want to limit your changes to only apply to HPC Nodes:
{% if context.get("/job/NodeType") == "compute_node" %}
echo "This code is only executed on Compute Node (HPC)"
{% endif %}
Danger - Read me
Modifying a file under /cluster_node_boostrap
has immediate effect and will not require any service restart.
Any error in your script may prevent SOCA to successfully provision capacity. If you suspect an error, check the logs mentioned below.
Always create a backup of the file before modifying it.
Do not edit 01_user_data.sh.j2
unless you can confirm the rendered file will be less than 16kb
after your modifications.
If you need to edit the UserData post-SOCA deployment, navigate to EC2 Console > Launch Template and create a new version of the Login Node launch template using the updated UserData.
View Login Node Logs¶
Depending the operating system, UserData log can be found on the same host under:
/var/log/cloud-init.log
/var/log/cloud-init-output.log
/var/log/message
(this one is also copied to the location below)
All other logs (setup, post_reboot, user_customization) logs can be found on the shared filesystem: /apps/soca/CLUSTER_ID/cluster_node_boostrap/logs/compute_node/<job_id>