LLM Serving
Foundation models are commonly deployed to Amazon SageMaker AI managed inference within a serving container. These serving containers automatically support the handling of HTTP requests, though some require reverse proxies in this deployment to support the HTTP serving requirements of a real-time endpoint on Amazon SageMaker AI managed inference. To date, the following LLM servers are supported:
| Framework | Version | Base Image |
|---|---|---|
| vLLM | N/A | vllm/vllm-openai:v0.10.1 |
| SGLang | 0.5.4.post1 | lmsysorg/sglang:v0.5.4.post1 |
| TensorRT-LLM | N/A | nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc8 |
| LMI | N/A | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126 |
| DJL | N/A | deepjavalibrary/djl-serving:0.32.0-pytorch-cu126 |
vLLM¶
Under Construction
This section of the documentation is under construction.
SGLang¶
Under Construction
This section of the documentation is under construction.
TensorRT-LLM¶
Under Construction
This section of the documentation is under construction.
LMI¶
Under Construction
This section of the documentation is under construction.
DJL¶
Under Construction
This section of the documentation is under construction.