LLM Serving

Foundation models are commonly deployed to Amazon SageMaker AI managed inference within a serving container. These serving containers automatically support the handling of HTTP requests, though some require reverse proxies in this deployment to support the HTTP serving requirements of a real-time endpoint on Amazon SageMaker AI managed inference. To date, the following LLM servers are supported:

Framework	Version	Base Image
vLLM	N/A	vllm/vllm-openai:v0.10.1
SGLang	0.5.4.post1	lmsysorg/sglang:v0.5.4.post1
TensorRT-LLM	N/A	nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc8
LMI	N/A	763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126
DJL	N/A	deepjavalibrary/djl-serving:0.32.0-pytorch-cu126