BioNeMo on EKS
Deployment of ML models on EKS requires access to GPUs or Neuron instances. If your deployment isn't working, it’s often due to missing access to these resources. Also, some deployment patterns rely on Karpenter autoscaling and static node groups; if nodes aren't initializing, check the logs for Karpenter or Node groups to resolve the issue.
This blueprint should be considered as experimental and should only be used for proof of concept.
Introduction
NVIDIA BioNeMo is a generative AI platform for drug discovery that simplifies and accelerates the training of models using your own data and scaling the deployment of models for drug discovery applications. BioNeMo offers the quickest path to both AI model development and deployment, accelerating the journey to AI-powered drug discovery. It has a growing community of users and contributors, and is actively maintained and developed by the NVIDIA.
Given its containerized nature, BioNeMo finds versatility in deployment across various environments such as Amazon Sagemaker, AWS ParallelCluster, Amazon ECS, and Amazon EKS. This solution, however, zeroes in on the specific deployment of BioNeMo on Amazon EKS.
Source: https://blogs.nvidia.com/blog/bionemo-on-aws-generative-ai-drug-discovery/
Deploying BioNeMo on Kubernetes
This blueprint leverages three major components for its functionality. The NVIDIA Device Plugin facilitates GPU usage, FSx stores training data, and the Kubeflow Training Operator manages the actual training process.
In this blueprint, we will deploy an Amazon EKS cluster and execute both a data preparation job and a distributed model training job.
Pre-requisites
👈Deploy the blueprint
👈Verify Deployment
👈Run BioNeMo Training jobs
Once you've ensured that all components are functioning properly, you can proceed to submit jobs to your clusters.
Step1: Initiate the Uniref50 Data Preparation Task
The first task, named the uniref50-job.yaml
, involves downloading and partitioning the data to enhance processing efficiency. This task specifically retrieves the uniref50 dataset
and organizes it within the FSx for Lustre Filesystem. This structured layout is designed for training, testing, and validation purposes. You can learn more about the uniref dataset here.
To execute this job, navigate to the examples\esm1nv
directory and deploy the uniref50-job.yaml
manifest using the following commands:
cd examples/training
kubectl apply -f uniref50-job.yaml
It's important to note that this task requires a significant amount of time, typically ranging from 50 to 60 hours.
Run the below command to look for the pod uniref50-download-*
kubectl get pods
To verify its progress, examine the logs generated by the corresponding pod:
kubectl logs uniref50-download-xnz42
[NeMo I 2024-02-26 23:02:20 preprocess:289] Download and preprocess of UniRef50 data does not currently use GPU. Workstation or CPU-only instance recommended.
[NeMo I 2024-02-26 23:02:20 preprocess:115] Data processing can take an hour or more depending on system resources.
[NeMo I 2024-02-26 23:02:20 preprocess:117] Downloading file from https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz...
[NeMo I 2024-02-26 23:02:20 preprocess:75] Downloading file to /fsx/raw/uniref50.fasta.gz...
[NeMo I 2024-02-26 23:08:33 preprocess:89] Extracting file to /fsx/raw/uniref50.fasta...
[NeMo I 2024-02-26 23:12:46 preprocess:311] UniRef50 data processing complete.
[NeMo I 2024-02-26 23:12:46 preprocess:313] Indexing UniRef50 dataset.
[NeMo I 2024-02-26 23:16:21 preprocess:319] Writing processed dataset files to /fsx/processed...
[NeMo I 2024-02-26 23:16:21 preprocess:255] Creating train split...
After finishing this task, the processed dataset will be saved in the /fsx/processed
directory. Once this is done, we can move forward and start the pre-training
job by running the following command:
Following this, we can proceed to execute the pre-training job by running:
In this PyTorchJob YAML, the command python3 -m torch.distributed.run
plays a crucial role in orchestrating distributed training across multiple worker pods in your Kubernetes cluster.
It handles the following tasks:
- Initializes a distributed backend (e.g., c10d, NCCL) for communication between worker processes.In our example it's using c10d. This is a commonly used distributed backend in PyTorch that can leverage different communication mechanisms like TCP or Infiniband depending on your environment.
- Sets up environment variables to enable distributed training within your training script.
- Launches your training script on all worker pods, ensuring each process participates in the distributed training.
cd examples/training
kubectl apply -f esm1nv_pretrain-job.yaml
Run the below command to look for the pods esm1nv-pretraining-worker-*
kubectl get pods
NAME READY STATUS RESTARTS AGE
esm1nv-pretraining-worker-0 1/1 Running 0 11m
esm1nv-pretraining-worker-1 1/1 Running 0 11m
esm1nv-pretraining-worker-2 1/1 Running 0 11m
esm1nv-pretraining-worker-3 1/1 Running 0 11m
esm1nv-pretraining-worker-4 1/1 Running 0 11m
esm1nv-pretraining-worker-5 1/1 Running 0 11m
esm1nv-pretraining-worker-6 1/1 Running 0 11m
esm1nv-pretraining-worker-7 1/1 Running 0 11m
We should see 8 pods running. In the pod definition we have specified 8 worker replicas with 1 gpu limit for each. Karpenter provisioned 2 g5.12xlarge instance with 4 GPUs each. Since we set up "nprocPerNode" to "4", each node will be responsible for 4 jobs. For more details around distributed pytorch training see pytorch docs.
This training job can run for at least 3-4 days with g5.12xlarge nodes.
This configuration utilizes Kubeflow's PyTorch training Custom Resource Definition (CRD). Within this manifest, various parameters are available for customization. For detailed insights into each parameter and guidance on fine-tuning, you can refer to BioNeMo's documentation.
Based on the Kubeflow training operator documentation, if you do not specify the master replica pod explicitly, the first worker replica pod(worker-0) will be treated as the master pod.
To track the progress of this process, follow these steps:
kubectl logs esm1nv-pretraining-worker-0
Epoch 0: 7%|▋ | 73017/1017679 [00:38<08:12, 1918.0%
Additionally, get a snapshot of the GPU status for a specific worker node by running nvidia-smi
command inside a Kubernetes pod running in that node. If you want to have a more robust observability, you can refer to the DCGM Exporter.
kubectl exec esm1nv-pretraining-worker-0 -- nvidia-smi
Mon Feb 24 18:51:35 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 33C P0 112W / 300W | 3032MiB / 23028MiB | 95% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
Benefits of Distributed Training:
By distributing the training workload across multiple GPUs in your worker pods, you can train large models faster by leveraging the combined computational power of all GPUs. Handle larger datasets that might not fit on a single GPU's memory.
Conclusion
BioNeMo stands as a formidable generative AI tool tailored for the realm of drug discovery. In this illustrative example, we took the initiative to pretrain a custom model entirely from scratch, utilizing the extensive uniref50 dataset. However, it's worth noting that BioNeMo offers the flexibility to expedite the process by employing pretrained models directly provided by NVidia. This alternative approach can significantly streamline your workflow while maintaining the robust capabilities of the BioNeMo framework.