How It Works¶

This guide describes how ML Container Creator (MCC) works: what decisions it captures, how it generates deployment assets, and how those assets get built and deployed. For a hands-on walkthrough, see the Getting Started Guide.

The Three Decisions¶

Deploying a model to SageMaker requires three interrelated decisions:

Model selection -- Where does the model come from and in what format?
Model serving -- Which framework handles inference requests?
Model deployment -- What instance type and deployment target?

Each decision influences the generated technical assets (Dockerfiles, serving code, deployment scripts). MCC captures all three through its prompt flow or CLI flags and produces a complete, buildable project.

Generator Flow¶

MCC collects configuration, validates it, and generates a project directory with all the files needed to build, push, and deploy a container.

flowchart LR
    A[Collect config] --> B[Validate]
    B --> C[Generate project]
    C --> D[do/build]
    D --> E[do/push]
    E --> F[do/deploy]

Configuration can come from interactive prompts, CLI flags, environment variables, config files, or MCP servers. These sources are merged in a strict precedence order -- see the Configuration Guide for the full precedence chain.

In interactive mode, the generator walks through five phases: model selection (deployment config + model name), serving configuration (target, profile, base image), infrastructure (region, instance type — derived from model via the instance-sizer), details (framework version, modules), and project settings (name, destination). The instance type is a derived value — once the model and base image are known, the instance-sizer computes VRAM requirements and recommends compatible instances automatically. In non-interactive mode (--skip-prompts), all values come from CLI flags or config files.

The Getting Started Guide has complete walkthroughs for both a predictive model (sklearn + Flask) and an LLM (SGLang), including the exact CLI commands and generated project structures.

Models¶

MCC handles two categories of models with fundamentally different container architectures.

Aspect	Predictive ML	Generative AI
Examples	sklearn classifiers, XGBoost regressors, TensorFlow CNNs	Llama, Mistral, GPT
Model size	KB -- MB	GB -- hundreds of GB
Storage	Local files bundled in container at build time	Downloaded from HuggingFace Hub at runtime
Instance types	CPU-optimized (ml.m5, ml.m6g)	GPU-required (ml.g5, ml.g6)
Inference latency	Milliseconds	Seconds

Predictive Models¶

Predictive models are small models for classification, regression, and similar tasks. MCC requires a model format selection so the model loader knows which file to look for at /opt/ml/model/.

Supported frameworks¶

Framework	Model Formats	Generated Version	Use Case
scikit-learn	pkl joblib	scikit-learn==1.7.1 joblib==1.4.2	Classification, Regression, Clustering, Dimensionality Reduction, Data Preprocessing, Model Selection
XGBoost	json model ubj	xgboost==2.1.3	Prediction, Classification
TensorFlow	keras h5 SavedModel	tensorflow==2.20.0 setuptools>=65.0.0	Prediction, Classification, Deep learning, Neural Networks, NLP

Loading models into the container¶

Local copy is the default strategy. A model file from your local filesystem is copied into the container at build time using a Dockerfile COPY directive. When bringing your own model, uncomment and edit this line in the generated Dockerfile:

# COPY your_model_files /opt/ml/model/

The target directory must be /opt/ml/model/ for SageMaker compatibility.

Sample model. For testing without a real model, MCC can train a sample model on the Abalone dataset using the selected framework. The sample model is automatically copied into the container. This is for validating the build and deployment pipeline, not for production use.

Generative Models¶

Generative models (LLMs) are specified by HuggingFace model ID at generation time. MCC does not require a model format -- the serving framework handles downloading and loading the model into GPU memory.

{
  "framework": "transformers",
  "modelName": "mistralai/Mistral-7B-Instruct-v0.2",
  "modelServer": "sglang"
}

HuggingFace authentication¶

Some models are gated and require a HuggingFace API token. You can provide it via:

CLI flag: --hf-token="hf_..." or --hf-token='$HF_TOKEN'
Environment variable: export HF_TOKEN="hf_..." then reference $HF_TOKEN
Interactive prompt (when selecting a custom model)
Config file: "hfToken": "$HF_TOKEN"

See the Configuration Guide for details and security best practices.

HuggingFace API lookups¶

When a model ID is specified, MCC validates the model by querying HuggingFace endpoints:

Endpoint	Purpose
`GET /api/models/{modelId}`	Validate model exists, get metadata
`GET /{modelId}/resolve/main/tokenizer_config.json`	Extract chat template
`GET /{modelId}/resolve/main/config.json`	Get model architecture details

These calls time out after 5 seconds and handle 404/429 errors gracefully. Use --offline to skip them entirely.

Serving Frameworks¶

MCC generates different container architectures depending on the serving framework.

HTTP servers (predictive models)¶

Predictive models need an HTTP layer to expose the SageMaker-required /ping and /invocations endpoints on port 8080. MCC generates a model handler (model_handler.py) for inference logic and pairs it with a web server and Nginx reverse proxy.

Web Server	Description
Flask	Lightweight Python web framework, served via Gunicorn
FastAPI	Modern async framework with automatic OpenAPI docs

LLM servers (generative models)¶

LLM serving frameworks handle both model loading and HTTP serving. Some require an Nginx reverse proxy for SageMaker compatibility.

Framework	Version	Base Image
vLLM	N/A	vllm/vllm-openai:v0.10.1
SGLang	0.5.4.post1	lmsysorg/sglang:v0.5.4.post1
TensorRT-LLM	N/A	nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc8
LMI	N/A	763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu126
DJL	N/A	deepjavalibrary/djl-serving:0.32.0-pytorch-cu126

Triton Inference Server¶

For multi-framework, high-throughput serving, MCC generates Triton-compatible projects with model repository layouts and config.pbtxt files. Supported backends: FIL, ONNX Runtime, TensorFlow, PyTorch, vLLM, TensorRT-LLM, and Python.

Container Building¶

MCC generates Docker containers that package the base image, application code, configuration files, and dependencies. For predictive models, model artifacts are included in the image. For generative models, the container downloads models from HuggingFace Hub at runtime.

Local builds¶

Run ./do/build to create the image locally and ./do/push to upload it to Amazon ECR. You can test locally with ./do/run before pushing.

Locally built containers may produce exec errors if deployed onto a different architecture (e.g., building on ARM, deploying on x86). Use the --platform flag for Docker, or use CodeBuild for production builds.

AWS CodeBuild¶

For CI/CD workflows, ./do/submit creates a CodeBuild project that builds the image and pushes it to ECR in a single step. MCC generates the IAM policy document and buildspec.yml automatically. This is the preferred method for production containers, especially for large LLM images where fast network access to base image registries matters.

Endpoint Deployment¶

Once a container is built and pushed to ECR, ./do/deploy provisions the deployment target. MCC supports two targets:

Managed Inference (managed-inference): SageMaker real-time endpoints via the Inference Components API. This is the default.
HyperPod EKS (hyperpod-eks): Kubernetes deployment on existing SageMaker HyperPod clusters.

After deployment, ./do/test validates the endpoint, ./do/logs tails logs, and ./do/clean tears down resources.

See Deployment & Inference for the full lifecycle script reference and target-specific details.

Architecture Overview¶

The following diagram shows the end-to-end flow from model source through deployment. Items marked with * are on the roadmap.

flowchart TB
    subgraph sources["Model Sources"]
        local["Local Models<br/>.pkl, .joblib, .h5"]
        s3["S3 Buckets*"]
        hf["HuggingFace Hub<br/>LLMs, chat models"]
    end

    subgraph predict["Predictive ML"]
        pfw["Framework<br/>scikit-learn, XGBoost, TensorFlow"]
        handler["model_handler.py"]
        http["HTTP Server<br/>Flask / FastAPI + Nginx"]
    end

    subgraph genai["Generative AI"]
        gfw["Serving Framework<br/>vLLM, SGLang, TensorRT-LLM, LMI, DJL"]
    end

    subgraph container["Container Build"]
        dockerfile["Dockerfile Generation"]
        base["Base Image"]
        deps["Dependencies"]
        code["Application Code"]
        build["Docker Build<br/>local or CodeBuild"]
        ecr["Push to ECR"]
    end

    subgraph deploy["Deployment"]
        endpoint["SageMaker Endpoint<br/>port 8080<br/>GET /ping<br/>POST /invocations"]
        clients["Client Applications"]
    end

    sources --> predict & genai
    pfw --> handler --> http
    http --> dockerfile
    gfw --> dockerfile
    dockerfile --> base & deps & code
    base & deps & code --> build --> ecr --> endpoint --> clients