BioNeMo on EKS

warning

Deployment of ML models on EKS requires access to GPUs or Neuron instances. If your deployment isn't working, it’s often due to missing access to these resources. Also, some deployment patterns rely on Karpenter autoscaling and static node groups; if nodes aren't initializing, check the logs for Karpenter or Node groups to resolve the issue.

caution

This blueprint should be considered as experimental and should only be used for proof of concept.

Introduction

NVIDIA BioNeMo is a generative AI platform for drug discovery that simplifies and accelerates the training of models using your own data and scaling the deployment of models for drug discovery applications. BioNeMo offers the quickest path to both AI model development and deployment, accelerating the journey to AI-powered drug discovery. It has a growing community of users and contributors, and is actively maintained and developed by the NVIDIA.

Given its containerized nature, BioNeMo finds versatility in deployment across various environments such as Amazon Sagemaker, AWS ParallelCluster, Amazon ECS, and Amazon EKS. This solution, however, zeroes in on the specific deployment of BioNeMo on Amazon EKS.

Source: https://blogs.nvidia.com/blog/bionemo-on-aws-generative-ai-drug-discovery/

Deploying BioNeMo on Kubernetes

This blueprint leverages three major components for its functionality. The NVIDIA Device Plugin facilitates GPU usage, FSx stores training data, and the Kubeflow Training Operator manages the actual training process.

In this blueprint, we will deploy an Amazon EKS cluster and execute both a data preparation job and a distributed model training job.

Pre-requisites

👈

Deploy the blueprint

👈

Verify Deployment

👈

Run BioNeMo Training jobs

Once you've ensured that all components are functioning properly, you can proceed to submit jobs to your clusters.

Step1: Initiate the Uniref50 Data Preparation Task

The first task, named the uniref50-job.yaml, involves downloading and partitioning the data to enhance processing efficiency. This task specifically retrieves the uniref50 dataset and organizes it within the FSx for Lustre Filesystem. This structured layout is designed for training, testing, and validation purposes. You can learn more about the uniref dataset here.

To execute this job, navigate to the examples\esm1nv directory and deploy the uniref50-job.yaml manifest using the following commands:

cd examples/training
kubectl apply -f uniref50-job.yaml

info

It's important to note that this task requires a significant amount of time, typically ranging from 50 to 60 hours.

Run the below command to look for the pod uniref50-download-*

kubectl get pods

To verify its progress, examine the logs generated by the corresponding pod:

kubectl logs uniref50-download-xnz42

[NeMo I 2024-02-26 23:02:20 preprocess:289] Download and preprocess of UniRef50 data does not currently use GPU. Workstation or CPU-only instance recommended.
[NeMo I 2024-02-26 23:02:20 preprocess:115] Data processing can take an hour or more depending on system resources.
[NeMo I 2024-02-26 23:02:20 preprocess:117] Downloading file from https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz...
[NeMo I 2024-02-26 23:02:20 preprocess:75] Downloading file to /fsx/raw/uniref50.fasta.gz...
[NeMo I 2024-02-26 23:08:33 preprocess:89] Extracting file to /fsx/raw/uniref50.fasta...
[NeMo I 2024-02-26 23:12:46 preprocess:311] UniRef50 data processing complete.
[NeMo I 2024-02-26 23:12:46 preprocess:313] Indexing UniRef50 dataset.
[NeMo I 2024-02-26 23:16:21 preprocess:319] Writing processed dataset files to /fsx/processed...
[NeMo I 2024-02-26 23:16:21 preprocess:255] Creating train split...

After finishing this task, the processed dataset will be saved in the /fsx/processed directory. Once this is done, we can move forward and start the pre-training job by running the following command:

Following this, we can proceed to execute the pre-training job by running:

In this PyTorchJob YAML, the command python3 -m torch.distributed.run plays a crucial role in orchestrating distributed training across multiple worker pods in your Kubernetes cluster.

It handles the following tasks:

Initializes a distributed backend (e.g., c10d, NCCL) for communication between worker processes.In our example it's using c10d. This is a commonly used distributed backend in PyTorch that can leverage different communication mechanisms like TCP or Infiniband depending on your environment.
Sets up environment variables to enable distributed training within your training script.
Launches your training script on all worker pods, ensuring each process participates in the distributed training.

cd examples/training
kubectl apply -f esm1nv_pretrain-job.yaml

Run the below command to look for the pods esm1nv-pretraining-worker-*

kubectl get pods

NAME                           READY   STATUS    RESTARTS   AGE
esm1nv-pretraining-worker-0   1/1     Running   0          11m
esm1nv-pretraining-worker-1   1/1     Running   0          11m
esm1nv-pretraining-worker-2   1/1     Running   0          11m
esm1nv-pretraining-worker-3   1/1     Running   0          11m
esm1nv-pretraining-worker-4   1/1     Running   0          11m
esm1nv-pretraining-worker-5   1/1     Running   0          11m
esm1nv-pretraining-worker-6   1/1     Running   0          11m
esm1nv-pretraining-worker-7   1/1     Running   0          11m

We should see 8 pods running. In the pod definition we have specified 8 worker replicas with 1 gpu limit for each. Karpenter provisioned 2 g5.12xlarge instance with 4 GPUs each. Since we set up "nprocPerNode" to "4", each node will be responsible for 4 jobs. For more details around distributed pytorch training see pytorch docs.

info

This training job can run for at least 3-4 days with g5.12xlarge nodes.

This configuration utilizes Kubeflow's PyTorch training Custom Resource Definition (CRD). Within this manifest, various parameters are available for customization. For detailed insights into each parameter and guidance on fine-tuning, you can refer to BioNeMo's documentation.

info

Based on the Kubeflow training operator documentation, if you do not specify the master replica pod explicitly, the first worker replica pod(worker-0) will be treated as the master pod.

To track the progress of this process, follow these steps:

kubectl logs esm1nv-pretraining-worker-0

Epoch 0:   7%|▋         | 73017/1017679 [00:38<08:12, 1918.0%

Additionally, get a snapshot of the GPU status for a specific worker node by running nvidia-smi command inside a Kubernetes pod running in that node. If you want to have a more robust observability, you can refer to the DCGM Exporter.

kubectl exec esm1nv-pretraining-worker-0 -- nvidia-smi
Mon Feb 24 18:51:35 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   33C    P0             112W / 300W |   3032MiB / 23028MiB |     95%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Benefits of Distributed Training:

By distributing the training workload across multiple GPUs in your worker pods, you can train large models faster by leveraging the combined computational power of all GPUs. Handle larger datasets that might not fit on a single GPU's memory.

Conclusion

BioNeMo stands as a formidable generative AI tool tailored for the realm of drug discovery. In this illustrative example, we took the initiative to pretrain a custom model entirely from scratch, utilizing the extensive uniref50 dataset. However, it's worth noting that BioNeMo offers the flexibility to expedite the process by employing pretrained models directly provided by NVidia. This alternative approach can significantly streamline your workflow while maintaining the robust capabilities of the BioNeMo framework.

Cleanup

👈

Introduction​

Deploying BioNeMo on Kubernetes​

Pre-requisites

Deploy the blueprint

Verify Deployment

Run BioNeMo Training jobs​

Step1: Initiate the Uniref50 Data Preparation Task​

Benefits of Distributed Training:​

Conclusion​

Cleanup

Introduction

Deploying BioNeMo on Kubernetes

Run BioNeMo Training jobs

Step1: Initiate the Uniref50 Data Preparation Task

Benefits of Distributed Training:

Conclusion