Examples¶

Concise walkthroughs for each deployment configuration. Every example follows the same pattern: generate, build, deploy, test. For a full end-to-end tutorial, see Getting Started.

HTTP: sklearn + Flask¶

Deploy a scikit-learn model with Flask serving on a CPU instance.

ml-container-creator sklearn-flask-demo \
  --deployment-config=http-flask \
  --engine=sklearn \
  --model-format=pkl \
  --include-sample-model \
  --deployment-target=managed-inference \
  --instance-type=ml.m5.large \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

Copy your own model into the project (or use the generated sample):

cp /path/to/model.pkl code/model.pkl

Build, push, deploy:

./do/build
./do/push
./do/deploy
./do/test

HTTP: XGBoost + FastAPI¶

Deploy an XGBoost model with FastAPI serving.

ml-container-creator xgboost-fastapi-demo \
  --deployment-config=http-fastapi \
  --engine=xgboost \
  --model-format=json \
  --include-sample-model \
  --deployment-target=managed-inference \
  --instance-type=ml.m5.large \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

./do/build && ./do/push && ./do/deploy && ./do/test

HTTP: TensorFlow + Flask¶

Deploy a TensorFlow SavedModel with Flask serving.

ml-container-creator tf-flask-demo \
  --deployment-config=http-flask \
  --engine=tensorflow \
  --model-format=SavedModel \
  --include-sample-model \
  --deployment-target=managed-inference \
  --instance-type=ml.m5.large \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

./do/build && ./do/push && ./do/deploy && ./do/test

Transformers: vLLM¶

Deploy an LLM with vLLM. GPU instance required.

ml-container-creator vllm-demo \
  --deployment-config=transformers-vllm \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=managed-inference \
  --instance-type=ml.g6.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

LLM containers are large. Use CodeBuild for the image build:

./do/submit    # Build and push via CodeBuild
./do/deploy
./do/test

For gated models (e.g., Llama), add --hf-token='$HF_TOKEN' and export the token in your environment. See HuggingFace Authentication.

Transformers: SGLang¶

Deploy an LLM with SGLang. Same workflow as vLLM with a different deployment config.

ml-container-creator sglang-demo \
  --deployment-config=transformers-sglang \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=managed-inference \
  --instance-type=ml.g6.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

./do/submit && ./do/deploy && ./do/test

Transformers: TensorRT-LLM¶

Deploy an LLM with NVIDIA TensorRT-LLM. Requires NGC authentication for the base image and A10G or newer GPUs (ml.g5 instances, not ml.g6).

ml-container-creator trtllm-demo \
  --deployment-config=transformers-tensorrt-llm \
  --model-name=meta-llama/Llama-3.2-3B-Instruct \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

Set your NGC API key before building:

export NGC_API_KEY='your-ngc-api-key'
./do/submit && ./do/deploy && ./do/test

The generated container runs TensorRT-LLM on port 8081 behind an Nginx reverse proxy on port 8080 for SageMaker compatibility. Key environment variables for tuning:

Variable	Default	Description
`TRTLLM_TP_SIZE`	`1`	Tensor parallelism (set to GPU count)
`TRTLLM_MAX_BATCH_SIZE`	`256`	Maximum batch size
`TRTLLM_MAX_INPUT_LEN`	`2048`	Maximum input token length
`TRTLLM_MAX_OUTPUT_LEN`	`512`	Maximum output token length

Transformers: LMI (Large Model Inference)¶

Deploy an LLM with AWS Large Model Inference (DJL-based). Uses serving.properties for configuration instead of environment variables.

ml-container-creator lmi-demo \
  --deployment-config=transformers-lmi \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

./do/submit && ./do/deploy && ./do/test

Transformers: DJL¶

Deploy an LLM with Deep Java Library serving.

ml-container-creator djl-demo \
  --deployment-config=transformers-djl \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

./do/submit && ./do/deploy && ./do/test

Triton: FIL (Tree Models)¶

Deploy XGBoost or LightGBM models on NVIDIA Triton Inference Server using the Forest Inference Library backend.

ml-container-creator triton-fil-demo \
  --deployment-config=triton-fil \
  --model-format=json \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

The generator creates a Triton model repository layout with config.pbtxt. Place your model file in the generated model_repository/ directory before building.

./do/build && ./do/push && ./do/deploy && ./do/test

Triton: ONNX Runtime¶

Deploy ONNX models on Triton.

ml-container-creator triton-onnx-demo \
  --deployment-config=triton-onnxruntime \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

./do/build && ./do/push && ./do/deploy && ./do/test

Triton: Python Backend¶

Deploy custom Python models on Triton. The Python backend gives full control over preprocessing, inference, and postprocessing logic.

ml-container-creator triton-python-demo \
  --deployment-config=triton-python \
  --deployment-target=managed-inference \
  --instance-type=ml.g5.xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

Edit the generated model.py in the model repository to implement your inference logic, then build and deploy.

CodeBuild CI/CD¶

Any of the examples above can use CodeBuild for the image build instead of building locally. Set --build-target=codebuild during generation, then use ./do/submit instead of ./do/build + ./do/push:

./do/submit    # Creates CodeBuild project, uploads source, builds, pushes to ECR
./do/deploy    # Deploy the CodeBuild-built image
./do/test      # Validate the endpoint

./do/submit automatically creates the CodeBuild project, IAM service role, and S3 source bucket on first run. All projects share a single ECR repository (ml-container-creator) with project-specific image tags.

HyperPod EKS Deployment¶

Any of the examples above can target HyperPod EKS instead of managed inference. Set --deployment-target=hyperpod-eks and provide your cluster details:

ml-container-creator hyperpod-demo \
  --deployment-config=transformers-vllm \
  --model-name=openai/gpt-oss-20b \
  --deployment-target=hyperpod-eks \
  --hyperpod-cluster=my-cluster \
  --hyperpod-namespace=ml-serving \
  --instance-type=ml.g5.12xlarge \
  --build-target=codebuild \
  --region=us-east-1 \
  --skip-prompts

./do/submit && ./do/deploy && ./do/test hyperpod

Cleanup¶

Tear down resources when done to stop incurring charges:

./do/clean endpoint   # Delete SageMaker endpoint, config, and inference component
./do/clean ecr        # Delete ECR images
./do/clean codebuild  # Delete CodeBuild project and IAM role
./do/clean all        # All of the above