Supported Models¶
MCC validates specific model + server + instance combinations end-to-end through the full lifecycle — generate, build, deploy, test, tune, and adapt. If your configuration is listed here, every step has been tested and proven to work.
Models not listed below are still supported by MCC — the CLI generates projects for any HuggingFace model. You take on the validation responsibility for unlisted combinations.
Model Families¶
Qwen 3¶
| Model | Parameters | Instance | Tuning Techniques | Status |
|---|---|---|---|---|
| Qwen/Qwen3-0.6B | 0.6B | ml.g5.xlarge | SFT, DPO | ✅ Validated |
| Qwen/Qwen3-1.7B | 1.7B | ml.g5.xlarge | SFT, DPO | ✅ Validated |
| Qwen/Qwen3-4B | 4B | ml.g5.xlarge | SFT, DPO | ✅ Validated |
| Qwen/Qwen3-8B | 8B | ml.g5.xlarge | SFT, DPO | ✅ Validated |
| Qwen/Qwen3-14B | 14B | ml.g5.2xlarge | SFT, DPO | ✅ Validated |
| Qwen/Qwen3-32B | 32B | ml.g5.12xlarge | SFT, DPO, RLVR | ✅ Validated |
Qwen 2.5¶
| Model | Parameters | Instance | Tuning Techniques | Status |
|---|---|---|---|---|
| Qwen/Qwen2.5-7B-Instruct | 7B | ml.g5.xlarge | SFT, DPO | ✅ Validated |
| Qwen/Qwen2.5-14B-Instruct | 14B | ml.g5.2xlarge | SFT, DPO | ✅ Validated |
| Qwen/Qwen2.5-32B-Instruct | 32B | ml.g5.12xlarge | SFT, DPO | ✅ Validated |
| Qwen/Qwen2.5-72B-Instruct | 72B | ml.g5.48xlarge | SFT, DPO, RLVR | ✅ Validated |
Llama 3¶
| Model | Parameters | Instance | Tuning Techniques | Status |
|---|---|---|---|---|
| meta-llama/Llama-3.2-1B-Instruct | 1B | ml.g5.xlarge | SFT, DPO | ✅ Validated |
| meta-llama/Llama-3.2-3B-Instruct | 3B | ml.g5.xlarge | SFT, DPO | ✅ Validated |
| meta-llama/Llama-3.1-8B-Instruct | 8B | ml.g5.xlarge | SFT, DPO, RLVR | ✅ Validated |
| meta-llama/Llama-3.3-70B-Instruct | 70B | ml.g5.48xlarge | SFT, DPO, RLVR, RLAIF | ✅ Validated |
Gated Models
Llama models require a HuggingFace token with Meta's license agreement accepted. See Secrets Management for configuration.
DeepSeek R1 (Distilled)¶
| Model | Parameters | Instance | Tuning Techniques | Status |
|---|---|---|---|---|
| deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | ml.g5.xlarge | SFT | ✅ Validated |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | 7B | ml.g5.xlarge | SFT | ✅ Validated |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | 14B | ml.g5.2xlarge | SFT | ✅ Validated |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 32B | ml.g5.12xlarge | SFT | ✅ Validated |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | 8B | ml.g5.xlarge | SFT | ✅ Validated |
| deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 70B | ml.g5.48xlarge | SFT | ✅ Validated |
GPT-OSS¶
| Model | Parameters | Instance | Tuning Techniques | Status |
|---|---|---|---|---|
| openai/gpt-oss-20b | 20B | ml.g5.12xlarge | SFT, DPO | ✅ Validated |
| openai/gpt-oss-120b | 120B | ml.g5.48xlarge | SFT, DPO | ✅ Validated |
Inference Engines¶
vLLM — Fully Validated¶
All 22 models above are validated with vLLM as the inference engine. vLLM provides:
- High-throughput serving with PagedAttention
- OpenAI-compatible API (
/v1/chat/completions,/v1/completions) - LoRA adapter hot-swap (no endpoint restart required)
- Tensor parallelism for multi-GPU deployments
ml-container-creator my-model \
--deployment-config=transformers-vllm \
--model-name=Qwen/Qwen3-8B \
--instance-type=ml.g5.xlarge \
--enable-lora \
--skip-prompts
SGLang — Supported, Validation Pending¶
SGLang is supported by the generator but not yet included in the automated validation matrix. Generated projects work, but you may encounter edge cases not covered by testing.
TensorRT-LLM — Supported, Validation Pending¶
TensorRT-LLM is supported by the generator. Requires NGC authentication for base images. Not yet in the automated validation matrix.
DJL/LMI — Supported, Validation Pending¶
AWS Large Model Inference is supported by the generator. Not yet in the automated validation matrix.
Instance Recommendations¶
| Model Size | Recommended Instance | GPUs | VRAM | Tensor Parallelism |
|---|---|---|---|---|
| ≤8B (bf16) | ml.g5.xlarge | 1 | 24 GB | 1 |
| 8B–14B (bf16) | ml.g5.2xlarge | 1 | 24 GB | 1 |
| 14B–32B (bf16) | ml.g5.12xlarge | 4 | 96 GB | 4 |
| 32B–72B (bf16) | ml.g5.48xlarge | 8 | 192 GB | 8 |
| 70B+ (AWQ 4-bit) | ml.g5.12xlarge | 4 | 96 GB | 4 |
Quota Requirements
ml.g5.12xlarge and ml.g5.48xlarge often have a default quota of 0 for SageMaker AI endpoints. Request a quota increase via the Service Quotas console before deploying large models.
Tuning Techniques¶
All supported models include at least one fine-tuning technique via SageMaker AI Managed Model Customization (serverless — no instance management required).
| Technique | Description | Output | Deploy With |
|---|---|---|---|
| SFT | Supervised Fine-Tuning on prompt/completion pairs | LoRA adapter | do/adapter add --from-tune |
| DPO | Direct Preference Optimization on chosen/rejected pairs | LoRA adapter | do/adapter add --from-tune |
| RLVR | Reinforcement Learning with Verifiable Rewards | LoRA adapter | do/adapter add --from-tune |
| RLAIF | RL from AI Feedback (requires Lambda reward function) | LoRA adapter | do/adapter add --from-tune |
See Fine-Tuning for full documentation on do/tune and the tune-adapter-deploy feedback loop.
Lifecycle Coverage¶
Every model listed on this page has been validated through the complete lifecycle:
| Stage | What's Tested |
|---|---|
| Generate | CLI produces a valid project with Dockerfile, serve code, and do/ scripts |
| Build | Docker image builds successfully with correct base image and dependencies |
| Push | Image pushes to ECR without authentication or size issues |
| Deploy | Endpoint reaches InService within the expected timeout |
| Test | Health check passes + inference returns valid response |
| Tune | SageMaker AI managed customization job completes, adapter weights written to S3 |
| Adapter | LoRA adapter hot-swapped onto running endpoint without restart |
| Test (Adapter) | Inference with adapter returns valid response |
| Clean | All resources torn down, no orphaned endpoints |
Model Notes & Known Issues¶
General¶
- First deploy is slow — Model weights download from HuggingFace on first container start (5–30 min for large models). Subsequent deploys from the same ECR image are faster.
- Tensor parallelism must match GPU count — Set automatically by MCC based on instance type. Manual override via
do/config.
Qwen 3¶
- The 32B model requires
--max-model-lenadjustment for sequences > 8K tokens on g5.12xlarge (VRAM constraint).
Llama 3¶
- Gated model — requires HuggingFace token with Meta license acceptance.
- The 70B model takes ~15 min to reach InService on g5.48xlarge (weight download + model loading).
DeepSeek R1¶
- SFT-only (DPO and RLVR not supported by SageMaker AI managed customization for this family).
- Distilled variants maintain reasoning chain formatting in outputs.
GPT-OSS¶
- Newest family — less community validation than Qwen or Llama.
- The 120B model requires
ml.g5.48xlargequota (default 0 in most accounts).
Using Unsupported Models¶
MCC generates projects for any HuggingFace model, not just those listed here:
ml-container-creator my-custom-model \
--deployment-config=transformers-vllm \
--model-name=my-org/my-custom-model \
--instance-type=ml.g5.xlarge \
--skip-prompts
For unlisted models:
- ✅ Generation, build, push, and deploy will work if the model is compatible with the selected server
- ⚠️ Instance sizing is based on the MCP instance-sizer heuristic — verify manually for custom models
- ⚠️ Fine-tuning with
do/tuneonly works for models in the Supported Model Catalog - ⚠️ You take on validation responsibility — consider running
do/validatebefore deploying
Expansion Roadmap¶
These configurations are planned for future validation:
| Configuration | Status | Expected |
|---|---|---|
| SGLang (same 22 models) | Planned | Post-v1 |
| g6 instance family (NVIDIA L4) | Planned | Post-v1 |
| TensorRT-LLM (subset of models) | Planned | Post-v1 |
| DPO/RLVR full technique coverage | Planned | Post-v1 |
| Diffusion models (FLUX, Stable Diffusion) | Temporarily removed | TBD |