Skip to main content

Enable Slurm epilog Script

Slurm epilog scripts can be used to perform tasks automatically after a job completes on a cluster. Implementing Slurm epilog scripts allows users and administrators to automate essential post-job tasks, such as resource cleanup, logging and monitoring, notifcations, and data management:

  • Resource Cleanup: Automatically clean up allocated resources, terminate orphaned processes, and reset system states.
  • Logging and Monitoring: Capture job-related information, performance metrics, and potential issues for auditing, analysis, and troubleshooting. Resource Accounting: Update resource usage databases for accurate tracking, billing, or quota management.
  • Notifications: Send automated notifications to users or administrators about job statuses, errors, or other relevant events.
  • Data Management: Archive, transfer, or process data generated by jobs, ensuring efficient data handling and storage. Security: Enforce security policies and scan for unauthorized access or changes during job execution.

The following explains steps to add a sample (and benign!) epilog script to your hyperpod cluster, which will run on a compute node at the competition of any slurm job.

  1. Create a directory on your shared file system to house your epilog script and epilog log files.
info

In this example, we will assume there is a shared file system with path /fsx, however if you have a different path for your shared file system, you will need to substitute the path for your shared file system path into the example provided below.

On controller/head node:

# assume root privileges
sudo su

# cd into root FSxL directory, /fsx
cd /fsx

# confirm you are on the shared file system home directory
pwd
# should show /fsx

# make a directory to house epilog artifacts
mkdir epilog

# move into the newly created epilog directory
cd epilog

# confirm you are in the epilog directory
pwd
# should show /fsx/epilog
  1. The following example epilog script it will echo a timestamp, list the node-name, get a list of the top 10 running processes, and and list any active user sessions. The epilog script will write logs to the log directory on the shared file-system defined in step 1. in this case we use the same directory for our logs as we do the epilog script itself.
sudo bash -c 'cat > /fsx/epilog/epilog-script.sh <<EOF
#!/bin/bash

# Get the node name (hostname)
node_name=\$(hostname)

# Get the current timestamp
timestamp=\$(date)

# Get the list of top 10 running processes
top_processes=\$(ps -e --sort=-%mem | head -n 11)

# Get the Slurm job ID and job name
job_id=\${SLURM_JOB_ID:-"undefined_job_id"}
job_name=\${SLURM_JOB_NAME:-"undefined_job_name"}

# Get the list of active user sessions
user_sessions=\$(who)

# Define the log directory and create it if it doesn'\''t exist
log_dir="/fsx/epilog/logs"
mkdir -p "\$log_dir"

# Name the log file using the Slurm job name and job ID
logfile="\$log_dir/epilog_\${job_name}_\${job_id}.log"

# Log the information to the file
{
echo "Node Name: \$node_name"
echo "Timestamp: \$timestamp"
echo "Slurm Job ID: \$job_id"
echo "Slurm Job Name: \$job_name"
echo "Top 10 Running Processes:"
echo "\$top_processes"
echo ""
echo "Active User Sessions:"
echo "\$user_sessions"
} >> "\$logfile" 2>&1
EOF'

  1. Add execute permissions on epilog-script.sh and modify permissions to allow slurm to write to log directory /fsx/epilog/log:
sudo chmod +x epilog-script.sh

sudo chown -R slurm:slurm /fsx/epilog
sudo chmod -R 755 /fsx/epilog

  1. With the epilog script written to /fsx/epilog/epilog-script.sh, next step is to modify slurm.conf. To do so, we will add a line to specify the path to your epilog script under the section. Lets start by grepping our slurm.conf file to see where we will add a refence path for slurm to execute the epilog script created in step 2.
grep -A 3 "# Slurmctld settings" /opt/slurm/etc/slurm.conf

Now lets add a line in slurm.conf to tell slurm where to find our epilog script:

sudo sed -i '/# Slurmctld settings/a Epilog=/fsx/epilog/epilog-script.sh' /opt/slurm/etc/slurm.conf
info

Optionally, as an alternative to standard Epilog which runs on each individual cluster node, you can specify a EpilogSlurmctld which will only run on Slurmctl node (Controller Node): sudo sed -i '/# Slurmctld settings/a EpilogSlurmctld=/fsx/epilog/epilog-script.sh' /opt/slurm/etc/slurm.conf

Once added, we can verify the line has been added with:

cat /opt/slurm/etc/slurm.conf | grep "epilog"

# should show "Epilog=/fsx/epilog/epilog-script.sh"
  1. With the epilog defined in slurm.conf, lets apply the new configuration to the cluster.
info

When we make modifications to slurm.conf, it is required to restart slurmctld (Slurm Controller Daemon) to apply configuration changes. This process should not effect any running jobs if you are just adding an epilog script to slurm.conf. You will, however, see a breif downtime (1-2min) when trying to run slurm commands likesinfo that communicate with the Slurm Controller Daemon. It is best practice to notify those using the cluster that you are restarting slurmctld before doing so.

# restart slurm ctld (takes 1-2 minutes)
sudo systemctl restart slurmctld

info

After restarting slurmctld, it is normal to observe temporary downtime when running slurm commands. Dont worry, your running jobs wont be effected. This is normal and expected behavior when restarting slurmctld, wait another minute or so and try again.

Once sinfo is responding, it is safe to run scontrol reconfigure to propogate the changes to the cluster nodes (this process will also take 1-2 minutes):

#pass the updated configuration to the cluster nodes (takes 1-2 mintues)
sudo scontrol reconfigure
# note slurm commands may be temporarily unavailable after executing this command. They should return after 1-2 minutes
  1. Congratulations! After following these steps, you can verify your epilog script is configured with slurm by running a job. Upon job completetion, you can view the output of the epilog script in the directory /fsx/epilog. You can now modify and adapt the epilog script to accomplish more useful tasks than just logging!