How It Works¶
This guide describes how ML Container Creator (MCC) works: what decisions it captures, how it generates deployment assets, and how those assets get built and deployed. For a hands-on walkthrough, see the Getting Started Guide.
The Three Decisions¶
Deploying a model to SageMaker requires three interrelated decisions:
- Model selection -- Where does the model come from and in what format?
- Model serving -- Which framework handles inference requests?
- Model deployment -- What instance type and deployment target?
Each decision influences the generated technical assets (Dockerfiles, serving code, deployment scripts). MCC captures all three through its prompt flow or CLI flags and produces a complete, buildable project.
Generator Flow¶
MCC collects configuration, validates it, and generates a project directory with all the files needed to build, push, and deploy a container.
flowchart LR
A[Collect config] --> B[Validate]
B --> C[Generate project]
C --> D[do/build]
D --> E[do/push]
E --> F[do/deploy]
Configuration can come from interactive prompts, CLI flags, environment variables, config files, or MCP servers. These sources are merged in a strict precedence order -- see the Configuration Guide for the full precedence chain.
In interactive mode, the generator walks through five phases: model selection (deployment config + model name), serving configuration (target, profile, base image), infrastructure (region, instance type — derived from model via the instance-sizer), details (framework version, modules), and project settings (name, destination). The instance type is a derived value — once the model and base image are known, the instance-sizer computes VRAM requirements and recommends compatible instances automatically. In non-interactive mode (--skip-prompts), all values come from CLI flags or config files.
The Getting Started Guide has complete walkthroughs for both a predictive model (sklearn + Flask) and an LLM (SGLang), including the exact CLI commands and generated project structures.
Models¶
MCC handles two categories of models with fundamentally different container architectures.
| Aspect | Predictive ML | Generative AI |
|---|---|---|
| Examples | sklearn classifiers, XGBoost regressors, TensorFlow CNNs | Llama, Mistral, GPT |
| Model size | KB -- MB | GB -- hundreds of GB |
| Storage | Local files bundled in container at build time | Downloaded from HuggingFace Hub at runtime |
| Instance types | CPU-optimized (ml.m5, ml.m6g) | GPU-required (ml.g5, ml.g6) |
| Inference latency | Milliseconds | Seconds |
Predictive Models¶
Predictive models are small models for classification, regression, and similar tasks. MCC requires a model format selection so the model loader knows which file to look for at /opt/ml/model/.
Supported frameworks¶
| Framework | Model Formats | Generated Version | Use Case |
|---|---|---|---|
| scikit-learn | pkl joblib |
scikit-learn==1.7.1 joblib==1.4.2 |
Classification, Regression, Clustering, Dimensionality Reduction, Data Preprocessing, Model Selection |
| XGBoost | json model ubj |
xgboost==2.1.3 | Prediction, Classification |
| TensorFlow | keras h5 SavedModel |
tensorflow==2.20.0 setuptools>=65.0.0 |
Prediction, Classification, Deep learning, Neural Networks, NLP |
Loading models into the container¶
Local copy is the default strategy. A model file from your local filesystem is copied into the container at build time using a Dockerfile COPY directive. When bringing your own model, uncomment and edit this line in the generated Dockerfile:
The target directory must be /opt/ml/model/ for SageMaker compatibility.
Sample model. For testing without a real model, MCC can train a sample model on the Abalone dataset using the selected framework. The sample model is automatically copied into the container. This is for validating the build and deployment pipeline, not for production use.
Generative Models¶
Generative models (LLMs) are specified by HuggingFace model ID at generation time. MCC does not require a model format -- the serving framework handles downloading and loading the model into GPU memory.
{
"framework": "transformers",
"modelName": "mistralai/Mistral-7B-Instruct-v0.2",
"modelServer": "sglang"
}
HuggingFace authentication¶
Some models are gated and require a HuggingFace API token. You can provide it via:
- CLI flag:
--hf-token="hf_..."or--hf-token='$HF_TOKEN' - Environment variable:
export HF_TOKEN="hf_..."then reference$HF_TOKEN - Interactive prompt (when selecting a custom model)
- Config file:
"hfToken": "$HF_TOKEN"
See the Configuration Guide for details and security best practices.
HuggingFace API lookups¶
When a model ID is specified, MCC validates the model by querying HuggingFace endpoints:
| Endpoint | Purpose |
|---|---|
GET /api/models/{modelId} |
Validate model exists, get metadata |
GET /{modelId}/resolve/main/tokenizer_config.json |
Extract chat template |
GET /{modelId}/resolve/main/config.json |
Get model architecture details |
These calls time out after 5 seconds and handle 404/429 errors gracefully. Use --offline to skip them entirely.
Serving Frameworks¶
MCC generates different container architectures depending on the serving framework.
HTTP servers (predictive models)¶
Predictive models need an HTTP layer to expose the SageMaker-required /ping and /invocations endpoints on port 8080. MCC generates a model handler (model_handler.py) for inference logic and pairs it with a web server and Nginx reverse proxy.
| Web Server | Description |
|---|---|
| Flask | Lightweight Python web framework, served via Gunicorn |
| FastAPI | Modern async framework with automatic OpenAPI docs |
LLM servers (generative models)¶
LLM serving frameworks handle both model loading and HTTP serving. Some require an Nginx reverse proxy for SageMaker compatibility.
| Framework | Version | Base Image |
|---|---|---|
| vLLM | N/A | vllm/vllm-openai:v0.10.1 |
| SGLang | 0.5.4.post1 | lmsysorg/sglang:v0.5.4.post1 |
| TensorRT-LLM | N/A | nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc8 |
| LMI | N/A | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126 |
| DJL | N/A | deepjavalibrary/djl-serving:0.32.0-pytorch-cu126 |
Triton Inference Server¶
For multi-framework, high-throughput serving, MCC generates Triton-compatible projects with model repository layouts and config.pbtxt files. Supported backends: FIL, ONNX Runtime, TensorFlow, PyTorch, vLLM, TensorRT-LLM, and Python.
Container Building¶
MCC generates Docker containers that package the base image, application code, configuration files, and dependencies. For predictive models, model artifacts are included in the image. For generative models, the container downloads models from HuggingFace Hub at runtime.
Local builds¶
Run ./do/build to create the image locally and ./do/push to upload it to Amazon ECR. You can test locally with ./do/run before pushing.
Locally built containers may produce exec errors if deployed onto a different architecture (e.g., building on ARM, deploying on x86). Use the --platform flag for Docker, or use CodeBuild for production builds.
AWS CodeBuild¶
For CI/CD workflows, ./do/submit creates a CodeBuild project that builds the image and pushes it to ECR in a single step. MCC generates the IAM policy document and buildspec.yml automatically. This is the preferred method for production containers, especially for large LLM images where fast network access to base image registries matters.
Endpoint Deployment¶
Once a container is built and pushed to ECR, ./do/deploy provisions the deployment target. MCC supports two targets:
- Managed Inference (
managed-inference): SageMaker real-time endpoints via the Inference Components API. This is the default. - HyperPod EKS (
hyperpod-eks): Kubernetes deployment on existing SageMaker HyperPod clusters.
After deployment, ./do/test validates the endpoint, ./do/logs tails logs, and ./do/clean tears down resources.
See Deployment & Inference for the full lifecycle script reference and target-specific details.
Architecture Overview¶
The following diagram shows the end-to-end flow from model source through deployment. Items marked with * are on the roadmap.
flowchart TB
subgraph sources["Model Sources"]
local["Local Models<br/>.pkl, .joblib, .h5"]
s3["S3 Buckets*"]
hf["HuggingFace Hub<br/>LLMs, chat models"]
end
subgraph predict["Predictive ML"]
pfw["Framework<br/>scikit-learn, XGBoost, TensorFlow"]
handler["model_handler.py"]
http["HTTP Server<br/>Flask / FastAPI + Nginx"]
end
subgraph genai["Generative AI"]
gfw["Serving Framework<br/>vLLM, SGLang, TensorRT-LLM, LMI, DJL"]
end
subgraph container["Container Build"]
dockerfile["Dockerfile Generation"]
base["Base Image"]
deps["Dependencies"]
code["Application Code"]
build["Docker Build<br/>local or CodeBuild"]
ecr["Push to ECR"]
end
subgraph deploy["Deployment"]
endpoint["SageMaker Endpoint<br/>port 8080<br/>GET /ping<br/>POST /invocations"]
clients["Client Applications"]
end
sources --> predict & genai
pfw --> handler --> http
http --> dockerfile
gfw --> dockerfile
dockerfile --> base & deps & code
base & deps & code --> build --> ecr --> endpoint --> clients