Downloading the Llama3-70b model
In this section, we will download the Llama3 model and the llama tokenizer. We will then also prepare the model for the Neuron runtime by converting the model weights to be pre-sharded based on the parallel processing configuration (i.e., the degrees of the model parallelism axes).
Download the Llama3-70b model and tokenizer
First, make sure that you have a Hugging Face account with a valid User Access Token. Also, since the Llama3 herd of families are hosted on gated repos on Hugging Face, please make sure that your Hugging Face account has access to the Meta-Llama-3-70B model repository.
On your head node, run
huggingface-cli login
You will be prompted to enter your token. Paste in the token and answer n
when prompted to add the token as a git credential.
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|
To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /fsx/ubuntu/.cache/huggingface/token
Login successful
Now that you're logged in, let's grab the model weights (you may choose to use git clone
if you wish too):
huggingface-cli download meta-llama/Meta-Llama-3-70B --local-dir /fsx/ubuntu/Meta-Llama-3-70B
Once the download is completed (~30 min), you will see the following directory structure:
/fsx/ubuntu/Meta-Llama-3-70B/
├── LICENSE
├── README.md
├── USE_POLICY.md
├── config.json
├── generation_config.json
├── model-00001-of-00030.safetensors
...
├── model-00030-of-00030.safetensors
├── model.safetensors.index.json
├── original
│ ├── consolidated.00.pth
....
│ ├── consolidated.07.pth
│ ├── params.json
│ └── tokenizer.model
├── special_tokens_map.json
├── tokenizer.json
└── tokenizer_config.json
Copy over the tokenizer configs under the test case repository
cp /fsx/ubuntu/Meta-Llama-3-70B/*token* /fsx/ubuntu/llama
Convert the Llama3 model weights
As mentioned, NxD requires that the model checkpoints be pre-sharded based on the chosen parallel configurations (tensor, pipeline parallelism degrees). The preprocessing entails:
- Saving the original checkpoint into a single binary file.
- Sharding the checkpoints (binary file) using the provided
convert_checkpoints.py
utility script.
First, let's save the original checkpoints into a single binary file.
cat > save-llama3-70B-model.py << EOF
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model = AutoModelForCausalLM.from_pretrained("/fsx/ubuntu/Meta-Llama-3-70B")
torch.save(model.state_dict(), '/fsx/ubuntu/llama-3-70b.pt')
EOF
Let's then run this created script using sbatch
on a cluster compute node (ml.trn1.32xlarge), which has enough HBM memory to be able to load the model and run the script:
sbatch --job-name=save-checkpoints --output=logs/save-checkpoints.out \
--wrap "srun python save-llama3-70B-model.py"
Once this job completes, let's convert the checkpoints (i.e., shard the checkpoints):
mkdir -p /fsx/ubuntu/llama3_70B/pretrained_weight
sbatch --job-name=convert-checkpoint --output=logs/convert-checkpoint.out \
--wrap "\
srun python convert_checkpoints.py \
--hw_backend trn1 \
--tp_size 32 --pp_size 8 --n_layers 80 \
--save_xser 1 \
--kv_size_multiplier 4 \
--qkv_linear 1 \
--fuse_qkv True \
--input_dir /fsx/ubuntu/llama-3-70b.pt \
--output_dir /fsx/ubuntu/llama3_70B/pretrained_weight \
--config /fsx/ubuntu/Meta-Llama-3-70B/config.json \
--convert_from_full_state"
You can track the progress by tailing your defined log file:
tail -f logs/convert-checkpoint.out
Your logs will look like
Saving to /fsx/ubuntu/llama3_70B/pretrained_weight/model/dp_rank_00_tp_rank_00_pp_rank_00.pt
Saving to /fsx/ubuntu/llama3_70B/pretrained_weight/model/dp_rank_00_tp_rank_00_pp_rank_01.pt
Saving to /fsx/ubuntu/llama3_70B/pretrained_weight/model/dp_rank_00_tp_rank_00_pp_rank_02.pt
Saving to /fsx/ubuntu/llama3_70B/pretrained_weight/model/dp_rank_00_tp_rank_00_pp_rank_03.pt
Saving to /fsx/ubuntu/llama3_70B/pretrained_weight/model/dp_rank_00_tp_rank_00_pp_rank_04.pt
Saving to /fsx/ubuntu/llama3_70B/pretrained_weight/model/dp_rank_00_tp_rank_00_pp_rank_05.pt
...
At the end of this process, we will end up with 32 x 8 = 256 checkpoints. This is because convert_checkpoints.py
shards the model per tensor parallel and pipeline parallel dimensions.
As a sanity check:
ls /fsx/ubuntu/llama3_70B/pretrained_weight/model/dp_rank_*_tp_rank_*_pp_rank_*.pt | wc -l
# Output should be 256
Note: The sharding is done based on the hardware setup. In our case, we are running on a cluster of 16 x ml.trn1.32xlarge instances (SageMaker HyperPod SLURM cluster).
Each ml.trn1.32xlarge instance has 16 Trainium Neuron Chips (Neuron Devices). Each of these Neuron Chips has 2 NeuronCore-v2 (i.e., 2 Neuron Cores), totalling to 32 Neuron Cores per ml.trn1.32large instance, and thus 512 Neuron Cores in the entire cluster.
Pipeline Parallelism: Given that we have 512 Neuron Cores, and that Llama3-70b has 80 layers, we can do:
- First 10 layers: Instance 1
- Second 10 layers: Instance 2
- ...
- Eighth 10 layers: Instance 8
=> Pipeline Parallelism = 8 (i.e., across 8 ml.trn1.32xlarge instances)
Tensor Parallelism: We split the model by layers (10) across 8 instances via Pipeline Parallelism. Within each of these instances, we can further split the layers via Tensor Parallelism, dividing the stage's parameters across 32 Neuron Cores.
Data Parallelism: Since we are maintaining two replicas of the sharded models, we employ data parallelism with a degree of 2 to speed up the training process. The resultant checkpoints will be used in the next continual pre-training stage.