Deploying model from S3 or FSX
You can deploy model artifacts directly from S3 or FSX to your HyperPod cluster using the InferenceEndpointConfig resource. The inference operator will use the S3 CSI driver to provide the model files to the pods in the cluster. Using this configuration, the operator will download the files located under the prefix deepseek15b
as set by the modelLocation
parameter.
Prerequisite
- Clone the repository sagemaker-genai-hosting-examples and open the directory under sagemaker-genai-hosting-examples/SageMakerHyperPod/hyperpod-inference
git clone https://github.com/aws-samples/sagemaker-genai-hosting-examples.git
cd sagemaker-genai-hosting-examples/SageMakerHyperPod/hyperpod-inference
Deploy the model
- Prepare model artifacts
When deploying a model from S3, we define the prefix that the model artifacts are available under.
You can upload the DeepSeek Qwen 1.5b artifacts to your S3 bucket or FSX ID, below is an example of copying it on S3 bucket with the following:
s3_bucket=<bucket_name>
aws s3 sync s3://jumpstart-cache-prod-us-east-2/deepseek-llm/deepseek-llm-r1-distill-qwen-1-5b/artifacts/inference-prepack/v2.0.0 s3://$s3_bucket/deepseek15b
Alternatively, you can copy the model to FSxL by creating a job pod with the FSX PVC mounted:
apiVersion: batch/v1
kind: Job
metadata:
name: copy-model-to-fsx
spec:
template:
spec:
containers:
- name: aws-cli
image: amazon/aws-cli:latest
command: ["/bin/bash"]
args:
- -c
- |
aws s3 sync s3://jumpstart-cache-prod-us-east-2/deepseek-llm/deepseek-llm-r1-distill-qwen-1-5b/artifacts/inference-prepack/v2.0.0 /fsx/deepseek15b
volumeMounts:
- name: fsx-storage
mountPath: /fsx
env:
- name: AWS_DEFAULT_REGION
value: "us-east-1" # Replace with your region
volumes:
- name: fsx-storage
persistentVolumeClaim:
claimName: fsx-claim
restartPolicy: Never
backoffLimit: 3
- Then in the
deploy_S3_inference_operator.yaml
ordeploy_fsx_lustre_inference_operator.yaml
we configure our S3 bucket/FSx ID as thes3Storage:bucketName
orfsxStorage:fileSystemId
.
spec:
modelName: deepseek15b
endpointName: deepseek15b
instanceType: ml.g5.8xlarge
invocationEndpoint: invocations
modelSourceConfig:
modelSourceType: s3
s3Storage:
bucketName: <bucket name>
Or for FSxL use
spec:
endpointName: deepseek15b
instanceType: ml.g5.8xlarge
invocationEndpoint: invocations
modelName: deepseek15b
modelSourceConfig:
fsxStorage:
fileSystemId: <fs-ID1234abcd>
modelLocation: deepseek-1-5b
modelSourceType: fsx
- Then we apply it to our cluster
kubectl apply -f deploy_S3_inference_operator.yaml
Or
kubectl apply -f deploy_fsx_inference_operator.yaml
- Check status of your deployment
kubectl describe InferenceEndpointConfig deepseek15b -n default
Invoking the model
Once the model is deployed, you can invoke the SageMaker Endpoint that is created using the InvokeEndpoint API. In the YAML file, this is defined as "deepseek15b".
For example, using boto3, this would be:
import boto3
import json
client = boto3.client('sagemaker-runtime')
response = client.invoke_endpoint(
EndpointName='deepseek15b',
ContentType='application/json',
Accept='application/json',
Body=json.dumps({
"inputs": "Hi, what can you help me with?"
})
)
print(response['Body'].read().decode('utf-8'))