Enable Checkpoints for Picotron

Checkpointing in Picotron is configured through the checkpoint section in the config.json file:

{
  "checkpoint": {
    "save_dir": "<your_run_path>",
    "save_frequency": 10,
    "load_path": ""
  }
}

The checkpoints will be saved in numbered subdirectories under the save_dir path.

Setup Checkpointing

Make sure you SSH into the head node of the cluster, then:

Navigate to the picotron directory and create a checkpoints directory:

cd ~/awsome-distributed-training/3.test_cases/pytorch/picotron/SmolLM-1.7B/slurm
mkdir -p checkpoints

Modify the ownership of the config directory:

sudo chown -R ubuntu:ubuntu conf/llama-1B-tp2/

Using jq, modify the checkpoint configuration:

jq --arg pwd "${PWD}" \
  '.checkpoint.save_dir = $pwd + "/checkpoints" | .checkpoint.load_path = "" | .checkpoint.save_frequency = 100' \
  conf/llama-1B-tp2/config.json > tmp.json && mv tmp.json conf/llama-1B-tp2/config.json

Verify the updates:

cat conf/llama-1B-tp2/config.json

note

We initially leave the load_path empty. On a fresh run, you want the model to initialize fresh weights. After the first checkpoint is saved, you can update load_path to point to the checkpoint directory for auto-resume functionality.

Setup Checkpointing​

Setup Checkpointing