Skip to content

HPC (compute) Node

The HPC node (also known as Compute Node) is a machine deployed for HPC jobs. They are ephemeral nodes unless provisioned behind an alwayson queue.

Bootstrap Code:

user_data.sh

Important

User Data is limited to 16kb size. It's not recommended to change the content of this file.

This is the default EC2 User Data generated at instance launch via /apps/soca/CLUSTER_ID/cluster_manager/orchestrator/cloudformation_builder.py and is located inside the /apps/soca/CLUSTER_ID/cluster_node_bootstrap/compute_node/ folder. This script only prepare the machine by installing/upgrading awscli and performing some filesystem operations.

setup.sh

This file is responsible for the main setup phase of your node. Template is located on /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node. This file can be updated post-SOCA deployment.

setup_post_reboot.sh

This file is responsible for the main setup phase and is executed after the first reboot triggered by setup.sh. Template is located on /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node. This file can be updated post-SOCA deployment.

setup_user_customization.sh

This file is the perfect place if you want to add your own set of configuration while not touching the existing node bootstrap sequence. Template is located on /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node. You can update this file post-SOCA deployment.

Bootstrap Sequence

sequenceDiagram
  autonumber
  SOCA User->>cloudformation_builder.py: Submit a new HPC job/simulation
  loop
     cloudformation_builder.py->>cloudformation_builder.py: Render UserData from user_data.sh.j2 
     cloudformation_builder.py->>cloudformation_builder.py: Render setup.sh script from setup.sh.j2 
     cloudformation_builder.py->>cloudformation_builder.py: Render setup_post_reboot.sh script from setup_post_reboot.sh.j2 
     cloudformation_builder.py->>cloudformation_builder.py: Render setup_user_customizations.sh script from setup_user_customizations.sh.j2 
  end 
  Note right of cloudformation_builder.py: Rendered templates are stored on cluster_node_boostrap/logs/compute_node/<job_id>/
  cloudformation_builder.py->>EC2 HPC Node: Launch EC2 HPC Node(s) and assign UserData
  loop
      EC2 HPC Node->>EC2 HPC Node: Execute UserData
      EC2 HPC Node->>EC2 HPC Node: Execute setup.sh script
      EC2 HPC Node->>EC2 HPC Node: Execute setup_post_reboot.sh script
      EC2 HPC Node->>EC2 HPC Node: Execute setup_user_customizations.sh script
  end 

Info

Templates are rendered using Jinja2 and stored on the filesystent

Note

Unlike HPC/Virtual Desktop nodes, there is no PostReboot actions for login nodes, as actions performed via /apps/soca/CLUSTER_ID/cluster_node_bootstrap/compute_node/setup_post_reboot.sh are specific to HPC jobs of Virtual destkop nodes.

Bootstrap Flow

  graph TD;
      A[SOCA EC2 Node Provisioned] --> B{Architecture?};
      B-- x86_64 -->C[Install x86_64 packages];
      B-- aarch64 -->D[Install aarch64 packages];
      C-->E[Common scripts];
      D-->E;
      E-->F[HPC Node Ready]  

Note

HPC Node is the default Compute Node for SOCA, meaning no extra packages are installed.

Capacity Provisioning

Here is the entire capacity provisioned flow after a job is submitted to the scheduler queue:

  graph TD;
      A[Job is sent to the queue] --> B[Job Queued];
      B--every minute-->C[SOCA retrieve current queue status via **dispatcher.py**];
      C-->D[SOCA determine hardware requirement per job via **dispatcher.py**];
      D-->E[SOCA Launch CloudFormation stack via **add_nodes.py**];
      E-->F{Capacity is provisioned?};
      F--noo-->G[Wait];
      G-->F;
      F--yes-->H[Nodes are added to OpenPBS via **nodes_manager.py**];
      H-->I{Nodes are ready?};
      I--noo-->J[Wait];
      J-->I;
      I--yes-->K[Job Start];
      K-->L{Job completed?};
      L--noo-->M[Wait];
      M-->L;
      L--yes-->N[Job is removed from the queue];
      N-->O[Cloudformation Stack is deleted via **dispatcher.py**];

Customize HPC Node Bootstrap Code

We recommend you to add your customizations to /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node/setup_user_customization.sh.j2. Alternatively, you can update /apps/soca/CLUSTER_ID/cluster_node_boostrap/compute_node/setup.sh.j2.

This file is common to HPC nodes, Virtual Desktop nodes and Login Nodes. Use the following condition if you want to limit your changes to only apply to HPC Nodes:

{% if context.get("/job/NodeType") == "compute_node" %}
  echo "This code is only executed on Compute Node (HPC)"
{% endif %}

Danger - Read me

Modifying a file under /cluster_node_boostrap has immediate effect and will not require any service restart.

Any error in your script may prevent SOCA to successfully provision capacity. If you suspect an error, check the logs mentioned below.

Always create a backup of the file before modifying it.

Do not edit 01_user_data.sh.j2 unless you can confirm the rendered file will be less than 16kb after your modifications.

If you need to edit the UserData post-SOCA deployment, navigate to EC2 Console > Launch Template and create a new version of the Login Node launch template using the updated UserData.

View Login Node Logs

Depending the operating system, UserData log can be found on the same host under:

  • /var/log/cloud-init.log
  • /var/log/cloud-init-output.log
  • /var/log/message (this one is also copied to the location below)

All other logs (setup, post_reboot, user_customization) logs can be found on the shared filesystem: /apps/soca/CLUSTER_ID/cluster_node_boostrap/logs/compute_node/<job_id>