Debug why your jobs are not starting

Jobs in dynamic queue¶

First of all, unless you submit a job on the "alwayson" queue, it will usually take between 5 to 10 minutes before your job can start as Engineering Development Hub needs to provision your capacity. This can vary based on the type and number of EC2 instances you have requested for your job. We recommend to provision EDH Optimized AMIs to reduce this cold time.

Verify Queue log¶

If your job is not starting, first verify the queue log under /opt/edh/<EDH_CLUSTER_ID>/cluster_manager/orchestrator/<SCHEDULER_ID>/queues/<queue_name>.log

tree /opt/soca/edh-demo/cluster_manager/orchestrator/logs/openpbs-default/queues/
/opt/soca/edh-demo/cluster_manager/orchestrator/logs/openpbs-default/queues/
├── high.log
├── high.log.2026-03-24
├── high.log.2026-03-31
├── job-shared.log
├── job-shared.log.2026-03-24
├── job-shared.log.2026-03-31
├── low.log
├── low.log.2026-03-24
├── low.log.2026-03-31
├── normal.log
├── normal.log.2026-03-24
├── normal.log.2026-03-31
├── test.log
├── test.log.2026-03-24
└── test.log.2026-03-31

If the log is not created or you don't see any update on it even though you submitted a job, try to run the dispatcher.py command manually. On the scheduler, list all crontabs as root crontab - and refer to "Automatic Host Provisioning" section:

# Automatic Host Provisioning
* * * * * /opt/edh/edh-demo/cluster_manager/orchestrator/jobs_dispatcher.sh compute
* * * * * /opt/edh/edh-demo/cluster_manager/orchestrator/jobs_dispatcher.sh job-shared
* * * * * /opt/edh/edh-demo/cluster_manager/orchestrator/jobs_dispatcher.sh test

Run the command manually and look for any errors. Common errors include malformed yaml files.

Verify Job log¶

Additionally, you can get per-job detailed logs via /opt/edh/<EDH_CLUSTER_ID>/cluster_manager/orchestrator/<SCHEDULER_ID>/jobs/<job_id>.log

tree /opt/soca/$EDH_CLUSTER_ID/cluster_manager/orchestrator/logs/openpbs-default/jobs/
├── 10.log
├── 11.log
├── 12.log
├── 13.log
├── 14.log
├── 15.log

Verify the job resource¶

This guide assume you have created your queue correctly

Run qstat -f <job_id> | grep -i resource and try to locate compute_node or stack_id resource. When your job is launched, these resources does not exist. The script dispatcher.py. running as a crontab and executed every 3 minutes will create these resources automatically.

Example of job having all resources configured correctly

# Job with Engineering Development Hub resources
bash-4.2$ qstat -f 2 | grep -i resource
    Resource_List.instance_type = m5.large
    Resource_List.ncpus = 3
    Resource_List.nodect = 3
    Resource_List.nodes = 3
    Resource_List.place = scatter
    Resource_List.select = 3:ncpus=1:compute_node=job2 
    Resource_List.stack_id = soca-fpgaami-job-2

Please note these resources are created by dispatcher.py so allow a maximum of 3 minutes between job is submitted and resources are visibles on qstat output

# Job without Engineering Development Hub resources created yet
bash-4.2$ qstat -f 2 | grep -i resource
    Resource_List.instance_type = m5.large
    Resource_List.ncpus = 3
    Resource_List.nodect = 3
    Resource_List.nodes = 3
    Resource_List.place = scatter
    Resource_List.select = 3:ncpus=1

If you see a compute_node different than tbd as well as stack_id, that means Engineering Development Hub triggered capacity provisioning by creating a new CloudFormation stack. If you go to your CloudFormation console, you should see a new stack being created using the following naming convention: soca-<cluster_name>-job-<job_id>

Verify Node log¶

On the controller host, access /apps/edh/<EDH_CLUSTER_ID>/shared/logs/bootstrap/compute_node/. This folder contains the output of all logs for all hosts provisioned by EDH

# Retrieve logs for the most recent (2 weeks) jobs
ls -ltr /apps/edh/<EDH_CLUSTER_ID>/shared/logs/bootstrap/compute_node/ | tail -n 5
drwxr-xr-x. 3 root root 6144 Mar 31 14:26 12
drwxr-xr-x. 3 root root 6144 Apr  1 10:22 13
drwxr-xr-x. 3 root root 6144 Apr  1 13:06 14
drwxr-xr-x. 4 root root 6144 Apr  1 14:59 15
drwxr-xr-x. 4 root root 6144 Apr  1 15:37 16

# Filter for a specific job id. Each nodes provisioned for this job will show up on the directory
ls -ltr /apps/edh/<EDH_CLUSTER_ID>/shared/logs/bootstrap/compute_node/10/**/ | tail -n 5
drw-------. 2 root root 6144 Mar 31 14:20 ip-74-0-167-165
drw-------. 2 root root 6144 Mar 31 14:20 ip-74-0-177-39

# For each hosts, you will be able to retrieve the install logs and do any troubleshooting
ls -ltr /apps/edh/<EDH_CLUSTER_ID>/shared/logs/bootstrap/compute_node/10/**/**
-rw-r--r--. 1 root root     118 Mar 31 14:14 bootstrap_s3_location.log
-rw-r--r--. 1 root root 1818089 Mar 31 14:17 02_setup.log
-rw-r--r--. 1 root root    5779 Mar 31 14:18 03_setup_post_reboot.log
-rw-r--r--. 1 root root  112752 Apr  3 14:10 sync_ad_users.log

If CloudFormation stack is NOT "CREATE_COMPLETE"¶

Click on the stack name then check the "Events" tab and refer to any "CREATE_FAILED" errors

In this example, the size of root device is too small and can be fixed by specify a bigger EBS disk using -l root_size=75

If CloudFormation stack is "CREATE_COMPLETE"¶

First, make sure CloudFormation has created a new "Launch Template" for your job.

Then navigate to AutoScaling console, select your AutoScaling group and click "Activity". You will see any EC2 errors related in this tab.

Here is an example of capacity being provisioned correctly

Here is an example of capacity provisioning errors:

If capacity is being provisioned correctly, go back to Engineering Development Hub and run pbsnodes -a. Verify the capacity assigned to your job ID (refer to resources_available.compute_node) is in state = free.

pbsnodes -a
ip-60-0-174-166
     Mom = ip-60-0-174-166.us-west-2.compute.internal
     Port = 15002
     pbs_version = 18.1.4
     ntype = PBS
     state = free
     pcpus = 1
     resources_available.arch = linux
     resources_available.availability_zone = us-west-2c
     resources_available.compute_node = job2
     resources_available.host = ip-60-0-174-166
     resources_available.instance_type = m5.large
     resources_available.mem = 7706180kb
     resources_available.ncpus = 1
     resources_available.subnet_id = subnet-0af93e96ed9c4377d
     resources_available.vnode = ip-60-0-174-166
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = normal
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Sat Oct 12 17:37:28 2019

If host is not in state = free after 10 minutes, SSH to the host, sudo as root and check the log file located under /root as well as /var/log/message | grep cloud-init