Skip to content

LLM Serving

Foundation models are commonly deployed to Amazon SageMaker AI managed inference within a serving container. These serving containers automatically support the handling of HTTP requests, though some require reverse proxies in this deployment to support the HTTP serving requirements of a real-time endpoint on Amazon SageMaker AI managed inference. To date, the following LLM servers are supported:

Framework Version Base Image
vLLM N/A vllm/vllm-openai:v0.10.1
SGLang 0.5.4.post1 lmsysorg/sglang:v0.5.4.post1
TensorRT-LLM N/A nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc8
LMI N/A 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126
DJL N/A deepjavalibrary/djl-serving:0.32.0-pytorch-cu126

vLLM

Under Construction

This section of the documentation is under construction.

SGLang

Under Construction

This section of the documentation is under construction.

TensorRT-LLM

Under Construction

This section of the documentation is under construction.

LMI

Under Construction

This section of the documentation is under construction.

DJL

Under Construction

This section of the documentation is under construction.