AIBrix
AIBrix is an open source initiative designed to provide essential building blocks to construct scalable GenAI inference infrastructure. AIBrix delivers a cloud-native solution optimized for deploying, managing, and scaling large language model (LLM) inference, tailored specifically to enterprise needs.
Features
- LLM Gateway and Routing: Efficiently manage and direct traffic across multiple models and replicas.
- High-Density LoRA Management: Streamlined support for lightweight, low-rank adaptations of models.
- Distributed Inference: Scalable architecture to handle large workloads across multiple nodes.
- LLM App-Tailored Autoscaler: Dynamically scale inference resources based on real-time demand.
- Unified AI Runtime: A versatile sidecar enabling metric standardization, model downloading, and management.
- Heterogeneous-GPU Inference: Cost-effective SLO-driven LLM inference using heterogeneous GPUs.
- GPU Hardware Failure Detection: Proactive detection of GPU hardware issues.
Deploying the Solution
👈Checking AIBrix Installation
Please run the below commands to check the AIBrix installation
kubectl get pods -n aibrix-system
Wait till all the pods are in Running status.
Running a model on AiBrix system
We will now run Deepseek-Distill-llama-8b model using AIBrix on EKS.
Please run the below command.
kubectl apply -f blueprints/inference/aibrix/deepseek-distill.yaml
This will deploy the model on deepseek-aibrix namespace. Wait for few minutes and run
kubectl get pods -n deepseek-aibrix
Wait for the pod to be in running state.
Accessing the model using gateway
Gateway is designed to serve LLM requests and provides features such as dynamic model & LoRA adapter discovery, user configuration for request count & token usage budgeting, streaming and advanced routing strategies such as prefix-cache aware, heterogeneous GPU hardware. To access the model using Gateway, Please run the below command
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
Once the port-forward is running, you can test the model by sending a request to the Gateway.
ENDPOINT="localhost:8888"
curl -v http://${ENDPOINT}/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}'
Cleanup
👈To avoid unwanted charges to your AWS account, delete all the AWS resources created during this deployment