Examples¶
Concise walkthroughs for each deployment configuration. Every example follows the same pattern: generate, build, deploy, test. For a full end-to-end tutorial, see Getting Started.
HTTP: sklearn + Flask¶
Deploy a scikit-learn model with Flask serving on a CPU instance.
ml-container-creator sklearn-flask-demo \
--deployment-config=http-flask \
--engine=sklearn \
--model-format=pkl \
--include-sample-model \
--deployment-target=managed-inference \
--instance-type=ml.m5.large \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
Copy your own model into the project (or use the generated sample):
Build, push, deploy:
HTTP: XGBoost + FastAPI¶
Deploy an XGBoost model with FastAPI serving.
ml-container-creator xgboost-fastapi-demo \
--deployment-config=http-fastapi \
--engine=xgboost \
--model-format=json \
--include-sample-model \
--deployment-target=managed-inference \
--instance-type=ml.m5.large \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
HTTP: TensorFlow + Flask¶
Deploy a TensorFlow SavedModel with Flask serving.
ml-container-creator tf-flask-demo \
--deployment-config=http-flask \
--engine=tensorflow \
--model-format=SavedModel \
--include-sample-model \
--deployment-target=managed-inference \
--instance-type=ml.m5.large \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
Transformers: vLLM¶
Deploy an LLM with vLLM. GPU instance required.
ml-container-creator vllm-demo \
--deployment-config=transformers-vllm \
--model-name=openai/gpt-oss-20b \
--deployment-target=managed-inference \
--instance-type=ml.g6.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
LLM containers are large. Use CodeBuild for the image build:
For gated models (e.g., Llama), add --hf-token='$HF_TOKEN' and export the token in your environment. See HuggingFace Authentication.
Transformers: SGLang¶
Deploy an LLM with SGLang. Same workflow as vLLM with a different deployment config.
ml-container-creator sglang-demo \
--deployment-config=transformers-sglang \
--model-name=openai/gpt-oss-20b \
--deployment-target=managed-inference \
--instance-type=ml.g6.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
Transformers: TensorRT-LLM¶
Deploy an LLM with NVIDIA TensorRT-LLM. Requires NGC authentication for the base image and A10G or newer GPUs (ml.g5 instances, not ml.g6).
ml-container-creator trtllm-demo \
--deployment-config=transformers-tensorrt-llm \
--model-name=meta-llama/Llama-3.2-3B-Instruct \
--deployment-target=managed-inference \
--instance-type=ml.g5.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
Set your NGC API key before building:
The generated container runs TensorRT-LLM on port 8081 behind an Nginx reverse proxy on port 8080 for SageMaker compatibility. Key environment variables for tuning:
| Variable | Default | Description |
|---|---|---|
TRTLLM_TP_SIZE |
1 |
Tensor parallelism (set to GPU count) |
TRTLLM_MAX_BATCH_SIZE |
256 |
Maximum batch size |
TRTLLM_MAX_INPUT_LEN |
2048 |
Maximum input token length |
TRTLLM_MAX_OUTPUT_LEN |
512 |
Maximum output token length |
Transformers: LMI (Large Model Inference)¶
Deploy an LLM with AWS Large Model Inference (DJL-based). Uses serving.properties for configuration instead of environment variables.
ml-container-creator lmi-demo \
--deployment-config=transformers-lmi \
--model-name=openai/gpt-oss-20b \
--deployment-target=managed-inference \
--instance-type=ml.g5.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
Transformers: DJL¶
Deploy an LLM with Deep Java Library serving.
ml-container-creator djl-demo \
--deployment-config=transformers-djl \
--model-name=openai/gpt-oss-20b \
--deployment-target=managed-inference \
--instance-type=ml.g5.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
Triton: FIL (Tree Models)¶
Deploy XGBoost or LightGBM models on NVIDIA Triton Inference Server using the Forest Inference Library backend.
ml-container-creator triton-fil-demo \
--deployment-config=triton-fil \
--model-format=json \
--deployment-target=managed-inference \
--instance-type=ml.g5.xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
The generator creates a Triton model repository layout with config.pbtxt. Place your model file in the generated model_repository/ directory before building.
Triton: ONNX Runtime¶
Deploy ONNX models on Triton.
ml-container-creator triton-onnx-demo \
--deployment-config=triton-onnxruntime \
--deployment-target=managed-inference \
--instance-type=ml.g5.xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
Triton: Python Backend¶
Deploy custom Python models on Triton. The Python backend gives full control over preprocessing, inference, and postprocessing logic.
ml-container-creator triton-python-demo \
--deployment-config=triton-python \
--deployment-target=managed-inference \
--instance-type=ml.g5.xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
Edit the generated model.py in the model repository to implement your inference logic, then build and deploy.
CodeBuild CI/CD¶
Any of the examples above can use CodeBuild for the image build instead of building locally. Set --build-target=codebuild during generation, then use ./do/submit instead of ./do/build + ./do/push:
./do/submit # Creates CodeBuild project, uploads source, builds, pushes to ECR
./do/deploy # Deploy the CodeBuild-built image
./do/test # Validate the endpoint
./do/submit automatically creates the CodeBuild project, IAM service role, and S3 source bucket on first run. All projects share a single ECR repository (ml-container-creator) with project-specific image tags.
HyperPod EKS Deployment¶
Any of the examples above can target HyperPod EKS instead of managed inference. Set --deployment-target=hyperpod-eks and provide your cluster details:
ml-container-creator hyperpod-demo \
--deployment-config=transformers-vllm \
--model-name=openai/gpt-oss-20b \
--deployment-target=hyperpod-eks \
--hyperpod-cluster=my-cluster \
--hyperpod-namespace=ml-serving \
--instance-type=ml.g5.12xlarge \
--build-target=codebuild \
--region=us-east-1 \
--skip-prompts
Cleanup¶
Tear down resources when done to stop incurring charges: