Troubleshooting Guide¶

Common issues and solutions for ML Container Creator.

Quick Reference - Most Common Issues¶

Issue	Quick Fix
`SyntaxError: Unexpected token 'export'`	`nvm use node`
`yo: command not found`	`npm install -g yo@latest`
Test script hanging	Run with `--verbose`, check directory isolation
ESLint errors	`npm run lint -- --fix`
npm audit failing CI	Only fail on critical: `npm audit --audit-level critical`
Property tests failing	Run from project root directory
Generator not found	`npm link` in project directory
Accelerator type mismatch	Use recommended instance type from error message
Registry file not found	`git checkout generators/app/config/registries/`
HuggingFace API timeout	Use `--offline` flag or wait for fallback
Environment variable validation error	Check type/range or use `--validate-env-vars=false`
Schema validation failed	Ensure all required fields present in registry

Generator Issues¶

Generator Not Found¶

Symptom:

$ yo ml-container-creator
Error: ml-container-creator generator not found

Solution:

# Link the generator
cd ml-container-creator
npm link

# Verify it's linked
npm list -g generator-ml-container-creator

# If still not working, reinstall Yeoman
npm install -g yo

Node Version Error¶

Symptom:

Error: The engine "node" is incompatible with this module
# OR
SyntaxError: Unexpected token 'export'
# OR  
SyntaxError: Cannot use import statement outside a module

Solution:

# Check Node version
node --version

# Must be 24.11.1 or higher for ES6 module support
# Install correct version using nvm
nvm install node  # Gets latest stable
nvm use node

# Or use nvm
nvm use node

# Pro tip: When encountering ES6 import/export errors, 
# this is usually a Node.js version compatibility issue

Template Variables Not Replaced¶

Symptom: Generated files contain <%= projectName %> instead of actual values.

Solution: - Check that templates use .ejs extension or are in templates directory - Verify copyTpl is used, not copy - Check for EJS syntax errors in templates

Yeoman Generator Not Working in CI¶

Symptom:

yo: command not found
# OR
generator-ml-container-creator not found

Solution:

# Install Yeoman globally in CI
npm install -g yo@latest

# Link the generator
npm link

# Verify installation
yo --version
yo --generators

Development and Testing Issues¶

ESLint Errors After Changes¶

Symptom:

✖ 9552 problems (9552 errors, 0 warnings)

Solution:

# Auto-fix formatting issues
npm run lint -- --fix

# Check what files are being linted
npx eslint --debug generators/

# Update .eslintrc.js ignore patterns for generated files
"ignorePatterns": [
    "site/**",
    "drafts/**", 
    "test-output-*/**"
]

# For unused variables, either use them or prefix with underscore
const _unusedVar = something;  // Indicates intentionally unused

Test Script Hanging¶

Symptom: Test scripts hang indefinitely on certain steps

Solution:

# Run with verbose output to see where it hangs
./scripts/test-generate-projects.sh --verbose

# Common causes:
# 1. Directory conflicts - each test should run in isolated directory
# 2. Property tests running from wrong directory - need project root
# 3. Environment variables not cleaned up between tests

# Fix: Ensure each test creates its own subdirectory
mkdir -p "test-$test_name"
cd "test-$test_name"
# ... run test ...
cd ..

Property-Based Tests Failing¶

Symptom:

npm run test:property
Error: Cannot find module './test/property-tests.js'

Solution:

# Property tests must run from project root directory
cd /path/to/project/root
npm run test:property

# In scripts, store and use project root:
PROJECT_ROOT="$(pwd)"
cd "$PROJECT_ROOT"
npm run test:property

Test Output Directory Conflicts¶

Symptom: Tests fail because previous test files interfere with new tests

Solution:

# Use timestamped directories
TEST_OUTPUT_DIR="./test-output-$(date +%Y%m%d-%H%M%S)"

# Clean up between test runs
rm -rf test-output-*

# Or keep output for debugging
KEEP_TEST_OUTPUT=true ./scripts/test-generate-projects.sh

CI/CD Issues¶

npm Security Audit Failures¶

Symptom:

npm audit
found 11 vulnerabilities (8 moderate, 3 high)

Solution:

# Option 1: Use npm overrides in package.json
"overrides": {
  "semver": "^7.6.3",
  "path-to-regexp": "^8.0.0"
}

# Option 2: Only fail on critical vulnerabilities
npm audit --audit-level critical

# Option 3: Update dependencies
npm update
npm audit fix

npm Cache Issues in CI¶

Symptom:

Error: EACCES: permission denied, mkdir '/github/home/.npm'

Solution: {% raw %}

# Remove npm cache configuration if package-lock.json is gitignored
# Don't use: npm ci --cache .npm

# Use standard npm install instead
npm install

# Or configure cache properly
- name: Cache node modules
  uses: actions/cache@v3
  with:
    path: ~/.npm
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}

CI Failing on Moderate Vulnerabilities¶

Symptom: CI fails even though vulnerabilities are not critical

Solution:

# In .github/workflows/ci.yml
# Only fail on critical vulnerabilities
- name: Security audit
  run: |
    if ! npm audit --audit-level critical; then
      echo "Critical vulnerabilities found!"
      exit 1
    fi

    # Show all vulnerabilities but don't fail
    npm audit || true

Yeoman Installation in CI¶

Symptom:

yo: command not found in CI

Solution:

# Install Yeoman before testing
- name: Install Yeoman
  run: npm install -g yo@latest

- name: Link generator
  run: npm link

- name: Test generator
  run: yo ml-container-creator --help

ESLint Failing in CI¶

Symptom: CI fails on linting errors that pass locally

Solution:

# Ensure consistent Node.js version
# Use same version in CI as locally

# Update .eslintrc.js with proper ignore patterns
"ignorePatterns": [
    "node_modules/**",
    "site/**",
    "drafts/**",
    "test-output-*/**"
]

# Run lint with --fix in CI if needed
npm run lint -- --fix

Docker Build Issues¶

Build Fails: Package Not Found¶

Symptom:

ERROR: Could not find a version that satisfies the requirement scikit-learn

Solution:

# Update requirements.txt with specific versions
scikit-learn==1.3.0
numpy==1.24.0

# Or use --no-cache-dir
docker build --no-cache -t my-model .

Build Fails: Permission Denied¶

Symptom:

ERROR: failed to solve: failed to copy files: failed to copy: permission denied

Solution:

# Check file permissions
ls -la code/

# Fix permissions
chmod 644 code/*
chmod 755 code/*.sh

# Rebuild
docker build -t my-model .

Build Fails: Model File Too Large¶

Symptom:

ERROR: failed to copy: file too large

Solution:

# Option 1: Use .dockerignore
echo "*.pyc" >> .dockerignore
echo "__pycache__" >> .dockerignore
echo "*.log" >> .dockerignore

# Option 2: Use multi-stage build
# Edit Dockerfile to copy only necessary files

# Option 3: Download model at runtime from S3
# (Recommended for large transformer models)

Local Testing Issues¶

Container Won't Start¶

Symptom:

$ docker run -p 8080:8080 my-model
Container exits immediately

Solution:

# Check container logs
docker logs <container-id>

# Run interactively to debug
docker run -it my-model /bin/bash

# Check if model file exists
ls -la /opt/ml/model/

# Check Python imports
python -c "import flask; import sklearn"

Port Already in Use¶

Symptom:

Error: bind: address already in use

Solution:

# Find process using port 8080
lsof -i :8080

# Kill the process
kill -9 <PID>

# Or use different port
docker run -p 8081:8080 my-model

Health Check Fails¶

Symptom:

$ curl http://localhost:8080/ping
curl: (7) Failed to connect to localhost port 8080

Solution:

# Check if container is running
docker ps

# Check container logs
docker logs <container-id>

# Verify server is listening
docker exec <container-id> netstat -tlnp | grep 8080

# Check firewall rules
# macOS: System Preferences > Security & Privacy > Firewall
# Linux: sudo ufw status

Inference Returns Error¶

Symptom:

$ curl -X POST http://localhost:8080/invocations -d '{"instances": [[1,2,3]]}'
{"error": "Model prediction failed"}

Solution:

# Check container logs for detailed error
docker logs <container-id>

# Common issues:
# 1. Wrong input format
curl -X POST http://localhost:8080/invocations \
  -H "Content-Type: application/json" \
  -d '{"instances": [[1.0, 2.0, 3.0]]}'

# 2. Missing features
# Ensure input has same number of features as training data

# 3. Wrong data type
# Convert strings to numbers, handle missing values

# Test model handler directly
docker exec -it <container-id> python
>>> from code.model_handler import ModelHandler
>>> handler = ModelHandler('/opt/ml/model')
>>> handler.predict([[1.0, 2.0, 3.0]])

AWS Deployment Issues¶

ECR Repository Not Found¶

Symptom:

Error: Repository does not exist

Solution:

# Create ECR repository
aws ecr create-repository --repository-name my-model

# Or let build_and_push.sh create it
./deploy/build_and_push.sh

Authentication Failed¶

Symptom:

Error: no basic auth credentials

Solution:

# Login to ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin \
  <account-id>.dkr.ecr.us-east-1.amazonaws.com

# Verify AWS credentials
aws sts get-caller-identity

# Check AWS CLI configuration
aws configure list

IAM Permission Denied¶

Symptom:

Error: User is not authorized to perform: ecr:CreateRepository

Solution:

# Check your IAM permissions
aws iam get-user

# Required permissions:
# - ecr:CreateRepository
# - ecr:PutImage
# - ecr:InitiateLayerUpload
# - ecr:UploadLayerPart
# - ecr:CompleteLayerUpload
# - sagemaker:CreateModel
# - sagemaker:CreateEndpointConfig
# - sagemaker:CreateEndpoint
# - iam:PassRole

# Contact your AWS administrator to add permissions

Image Push Timeout¶

Symptom:

Error: timeout while pushing image

Solution:

# Check internet connection
ping aws.amazon.com

# Increase Docker timeout
export DOCKER_CLIENT_TIMEOUT=300
export COMPOSE_HTTP_TIMEOUT=300

# Use faster network or retry
./deploy/build_and_push.sh

SageMaker Endpoint Issues¶

Endpoint Creation Failed¶

Symptom:

Error: Failed to create endpoint
Status: Failed

Solution:

# Check CloudWatch logs
aws logs tail /aws/sagemaker/Endpoints/my-model --follow

# Common issues:
# 1. Invalid IAM role
aws iam get-role --role-name SageMakerExecutionRole

# 2. Image not found in ECR
aws ecr describe-images --repository-name my-model

# 3. Insufficient capacity
# Try different instance type or region

# 4. Model artifacts not accessible
# Check S3 permissions in IAM role

Endpoint Stuck in Creating¶

Symptom:

$ aws sagemaker describe-endpoint --endpoint-name my-model
Status: Creating (for > 15 minutes)

Solution:

# Check CloudWatch logs for errors
aws logs tail /aws/sagemaker/Endpoints/my-model --follow

# Common causes:
# 1. Container fails to start
#    - Check Dockerfile CMD/ENTRYPOINT
#    - Verify port 8080 is exposed

# 2. Health check fails
#    - Ensure /ping endpoint returns 200
#    - Check response time < 2 seconds

# 3. Model loading fails
#    - Verify model file exists in container
#    - Check model format matches code

# Delete and recreate if stuck
aws sagemaker delete-endpoint --endpoint-name my-model
./deploy/deploy.sh <role-arn>

Endpoint Returns 500 Error¶

Symptom:

$ aws sagemaker-runtime invoke-endpoint --endpoint-name my-model ...
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation

Solution:

# Check CloudWatch logs
aws logs tail /aws/sagemaker/Endpoints/my-model --follow

# Common causes:
# 1. Exception in prediction code
#    - Add try/except blocks
#    - Log detailed errors

# 2. Wrong input format
#    - Verify Content-Type header
#    - Check JSON structure

# 3. Model not loaded
#    - Check model loading in __init__
#    - Verify model path

# Test locally first
docker run -p 8080:8080 my-model
curl -X POST http://localhost:8080/invocations -d '...'

Endpoint Throttling¶

Symptom:

Error: Rate exceeded

Solution:

# Increase instance count
aws sagemaker update-endpoint \
  --endpoint-name my-model \
  --endpoint-config-name my-model-config-v2

# Or enable auto-scaling
aws application-autoscaling register-scalable-target \
  --service-namespace sagemaker \
  --resource-id endpoint/my-model/variant/AllTraffic \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --min-capacity 1 \
  --max-capacity 5

Model Loading Issues¶

Model File Not Found¶

Symptom:

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pkl'

Solution:

# Verify model file is in code/ directory
ls -la code/

# Check Dockerfile COPY command
grep "COPY.*model" Dockerfile

# Verify model is in container
docker run my-model ls -la /opt/ml/model/

# For transformers, check S3 path
aws s3 ls s3://my-bucket/models/

Model Format Mismatch¶

Symptom:

ValueError: Model format not recognized

Solution:

# Verify model format matches code
# sklearn: .pkl or .joblib
# xgboost: .json, .model, or .ubj
# tensorflow: SavedModel directory or .h5

# Re-save model in correct format
import joblib
joblib.dump(model, 'model.pkl')

# Or update model_handler.py to match your format

Pickle Version Mismatch¶

Symptom:

ValueError: unsupported pickle protocol: 5

Solution:

# Save model with compatible protocol
import pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f, protocol=4)

# Or update Python version in Dockerfile
FROM python:3.9  # Use same version as training

Model Dependencies Missing¶

Symptom:

ModuleNotFoundError: No module named 'xgboost'

Solution:

# Add missing dependencies to requirements.txt
echo "xgboost==1.7.0" >> requirements.txt

# Rebuild container
docker build -t my-model .

# Verify dependencies
docker run my-model pip list

Performance Issues¶

Slow Inference¶

Symptom: Predictions take > 1 second for simple models

Solution:

# 1. Load model once at startup, not per request
class ModelHandler:
    def __init__(self):
        self.model = load_model()  # Load once

    def predict(self, data):
        return self.model.predict(data)  # Reuse

# 2. Use batch prediction
def predict(self, instances):
    # Process all instances at once
    return self.model.predict(instances)

# 3. Enable GPU acceleration
# Use gpu-enabled instance type

# 4. Optimize model
# - Use ONNX Runtime
# - Apply quantization
# - Prune unnecessary layers

# 5. Increase instance size
# ml.m5.xlarge -> ml.m5.2xlarge

High Memory Usage¶

Symptom:

Container killed: Out of memory

Solution:

# 1. Use larger instance
# ml.m5.xlarge (16GB) -> ml.m5.2xlarge (32GB)

# 2. Optimize model loading
import gc
model = load_model()
gc.collect()  # Free memory

# 3. Use model quantization
# Reduce model size by 4x with minimal accuracy loss

# 4. Process in smaller batches
batch_size = 16  # Reduce from 32

# 5. Clear cache between predictions
import torch
torch.cuda.empty_cache()  # For PyTorch models

Cold Start Latency¶

Symptom: First request takes 30+ seconds

Solution:

# 1. Warm up model at startup
class ModelHandler:
    def __init__(self):
        self.model = load_model()
        # Warm up with dummy prediction
        self.model.predict([[0] * num_features])

# 2. Use provisioned concurrency
# Keep instances warm

# 3. Optimize model loading
# - Use faster serialization (joblib vs pickle)
# - Load from local disk, not S3

# 4. Use smaller base image
FROM python:3.9-slim  # Instead of python:3.9

Concurrent Request Handling¶

Symptom: Endpoint can't handle multiple simultaneous requests

Solution:

# 1. Increase Gunicorn workers
# Edit start_server.py
gunicorn --workers 4 --threads 2 serve:app

# 2. Use async serving
# Switch to FastAPI with async endpoints

# 3. Enable auto-scaling
aws application-autoscaling put-scaling-policy \
  --policy-name my-scaling-policy \
  --service-namespace sagemaker \
  --resource-id endpoint/my-model/variant/AllTraffic \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --policy-type TargetTrackingScaling

# 4. Use multiple instances
InitialInstanceCount=2  # In endpoint config

Configuration Registry Issues¶

Registry File Not Found¶

Symptom:

Warning: Failed to load framework registry, using defaults

Solution:

# Verify registry files exist
ls -la generators/app/config/registries/

# Should see:
# - frameworks.js
# - models.js
# - instance-accelerator-mapping.js

# If missing, restore from git
git checkout generators/app/config/registries/

# Or reinstall
npm install
npm link

Invalid Registry Schema¶

Symptom:

Error: Framework registry entry missing required field: baseImage

Solution:

// Check registry entry has all required fields
{
  "vllm": {
    "0.5.0": {
      "baseImage": "vllm/vllm-openai:v0.5.0",  // Required
      "accelerator": {                          // Required
        "type": "cuda",
        "version": "12.1"
      },
      "envVars": {},                            // Required (can be empty)
      "inferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1",  // Required
      "recommendedInstanceTypes": ["ml.g5.xlarge"],  // Required
      "validationLevel": "tested"               // Required
    }
  }
}

// Run schema validation tests
npm test -- --grep "schema validation"

Configuration Not Applied¶

Symptom: Generated Dockerfile doesn't include expected environment variables from registry

Solution:

# 1. Verify framework and version match registry
cat generators/app/config/registries/frameworks.js | grep -A 20 "vllm"

# 2. Check configuration merge priority
# Priority: Framework base → Framework profile → HF API → Model registry → Model profile

# 3. Enable debug logging
DEBUG=* yo ml-container-creator

# 4. Check if graceful degradation is occurring
# If registry is empty, generator uses defaults

# 5. Verify registry is loaded
# Add console.log in configuration-manager.js
console.log('Loaded framework registry:', this.frameworkRegistry);

Profile Not Available¶

Symptom:

Warning: Profile 'low-latency' not found for vllm 0.5.0

Solution:

// Check if profile exists in registry
{
  "vllm": {
    "0.5.0": {
      // ... base config ...
      "profiles": {
        "low-latency": {  // Profile name must match
          "displayName": "Low Latency",
          "description": "Optimized for single-request latency",
          "envVars": {
            "VLLM_MAX_BATCH_SIZE": "1"
          }
        }
      }
    }
  }
}

// Or skip profile selection
yo ml-container-creator --skip-profile-selection

Accelerator Compatibility Issues¶

Accelerator Type Mismatch¶

Symptom:

Error: Framework requires cuda but instance provides neuron
Consider using ml.g5.xlarge, ml.g5.2xlarge, ml.g5.4xlarge

Solution:

# Option 1: Use recommended instance type
yo ml-container-creator \
  --framework=vllm \
  --instance-type=ml.g5.xlarge

# Option 2: Choose framework compatible with your instance
# For ml.inf2 (Neuron SDK), use transformers-neuron
yo ml-container-creator \
  --framework=transformers-neuron \
  --instance-type=ml.inf2.xlarge

# Option 3: Override validation (not recommended)
# Advanced users only - may result in deployment failures

Accelerator Version Mismatch¶

Symptom:

Warning: Framework requires CUDA 12.1, but instance only supports CUDA 11.8
Consider using ml.g5 or ml.g6 instances for CUDA 12.x support

Solution:

# Option 1: Use instance with correct CUDA version
# ml.g5 family: CUDA 12.x
# ml.g4dn family: CUDA 11.x
# ml.p3 family: CUDA 11.x

yo ml-container-creator \
  --framework=vllm \
  --version=0.5.0 \
  --instance-type=ml.g5.xlarge  # CUDA 12.x

# Option 2: Use older framework version
yo ml-container-creator \
  --framework=vllm \
  --version=0.3.0 \  # Requires CUDA 11.x
  --instance-type=ml.g4dn.xlarge

# Option 3: Check instance accelerator mapping
cat generators/app/config/registries/instance-accelerator-mapping.js

AMI Version Incompatibility¶

Symptom:

Error: InferenceAmiVersion al2-ami-sagemaker-inference-gpu-3-1 incompatible with instance

Solution:

# Check AMI version in framework registry
cat generators/app/config/registries/frameworks.js | grep inferenceAmiVersion

# Verify AMI provides required accelerator version
# GPU AMIs:
# - al2-ami-sagemaker-inference-gpu-3-1: CUDA 12.x
# - al2-ami-sagemaker-inference-gpu-2-0: CUDA 11.x

# Neuron AMIs:
# - al2-ami-sagemaker-inference-neuron-2-0: Neuron SDK 2.15+

# Update framework registry if needed
# See docs/REGISTRY_CONTRIBUTION_GUIDE.md

No Accelerator Data Available¶

Symptom:

Warning: No accelerator data for ml.custom.xlarge (best-effort validation)

Solution:

# This is informational - validation is best-effort

# Option 1: Add instance to mapping
# Edit generators/app/config/registries/instance-accelerator-mapping.js
{
  "ml.custom.xlarge": {
    "family": "custom",
    "accelerator": {
      "type": "cuda",
      "hardware": "NVIDIA A100",
      "architecture": "Ampere",
      "versions": ["12.1", "12.2"],
      "default": "12.2"
    },
    "memory": "32 GB",
    "vcpus": 8,
    "notes": "Custom instance type"
  }
}

# Option 2: Proceed with warning
# Generator will continue with default behavior

# Option 3: Contribute mapping
# See docs/REGISTRY_CONTRIBUTION_GUIDE.md

Custom Accelerator Type¶

Symptom:

Warning: No validator for accelerator type 'tpu' (best-effort validation)

Solution:

# Option 1: Create custom validator
# See docs/ACCELERATOR_VALIDATOR_GUIDE.md

# Option 2: Use existing accelerator type
# Supported types: cuda, neuron, cpu, rocm

# Option 3: Contribute validator
# 1. Create validator class extending AcceleratorValidator
# 2. Implement validate() and getVersionMismatchMessage()
# 3. Register in ValidationEngine
# 4. Add tests
# 5. Submit PR

HuggingFace API Issues¶

API Timeout¶

Symptom:

Warning: HuggingFace API timeout, checking local registry

Solution:

# This is expected behavior - generator falls back gracefully

# Option 1: Increase timeout (if slow network)
# Edit generators/app/lib/huggingface-client.js
this.timeout = 10000;  // 10 seconds instead of 5

# Option 2: Use offline mode
yo ml-container-creator --offline

# Option 3: Check network connectivity
curl -I https://huggingface.co

# Option 4: Use model registry instead
# Add model to generators/app/config/registries/models.js

Model Not Found¶

Symptom:

Warning: Model 'my-org/my-model' not found on HuggingFace, proceeding with defaults

Solution:

# This is expected for private or non-existent models

# Option 1: Verify model ID
# Check on https://huggingface.co/my-org/my-model

# Option 2: Add to model registry
# Edit generators/app/config/registries/models.js
{
  "my-org/my-model": {
    "family": "llama",
    "chatTemplate": "...",
    "requiresTemplate": true,
    "validationLevel": "experimental"
  }
}

# Option 3: Use offline mode
yo ml-container-creator --offline

# Option 4: Proceed without model-specific config
# Generator will use framework defaults

Rate Limit Exceeded¶

Symptom:

Warning: HuggingFace API rate limit exceeded, using cached data

Solution:

# Option 1: Wait and retry
# HuggingFace has rate limits for unauthenticated requests

# Option 2: Use HF_TOKEN for higher limits
export HF_TOKEN=hf_your_token_here
yo ml-container-creator

# Option 3: Use offline mode
yo ml-container-creator --offline

# Option 4: Use model registry
# Pre-configure models in local registry

Chat Template Not Found¶

Symptom:

Info: No chat template found for model, chat endpoints may not work

Solution: {% raw %}

# This is informational - not all models have chat templates

# Option 1: Add chat template to model registry
{
  "my-org/my-model": {
    "chatTemplate": "{% for message in messages %}...",
    "requiresTemplate": true
  }
}

# Option 2: Configure at runtime
# Set CHAT_TEMPLATE environment variable in deployment

# Option 3: Use model without chat
# Model will work for completion but not chat endpoints

# Option 4: Check HuggingFace model card
# Some models document chat template in README

Environment Variable Validation Issues¶

Unknown Environment Variable¶

Symptom:

Warning: Environment variable 'CUSTOM_FLAG' not found in known flags registry

Solution:

# This is a warning - generator will proceed

# Option 1: Add to known flags registry
# Edit generators/app/config/registries/framework-flags.js
{
  "vllm": {
    "0.5.0": {
      "CUSTOM_FLAG": {
        "type": "string",
        "description": "Custom configuration flag"
      }
    }
  }
}

# Option 2: Disable validation
yo ml-container-creator --validate-env-vars=false

# Option 3: Contribute flag definition
# See docs/REGISTRY_CONTRIBUTION_GUIDE.md

# Option 4: Proceed with warning
# Flag will still be included in generated files

Invalid Environment Variable Type¶

Symptom:

Error: Environment variable 'MAX_BATCH_SIZE' must be integer, got 'large'

Solution:

# Fix the value to match expected type

# Integer values
MAX_BATCH_SIZE=256

# Float values
GPU_MEMORY_UTILIZATION=0.9

# Boolean values
ENABLE_CACHING=true

# String values
MODEL_NAME="my-model"

# Check flag definition in registry
cat generators/app/config/registries/framework-flags.js

Environment Variable Out of Range¶

Symptom:

Warning: Environment variable 'GPU_MEMORY_UTILIZATION' value 1.5 exceeds maximum 1.0

Solution:

# Adjust value to be within valid range

# Check constraints in registry
{
  "GPU_MEMORY_UTILIZATION": {
    "type": "float",
    "min": 0.0,
    "max": 1.0,
    "description": "Fraction of GPU memory to use"
  }
}

# Use valid value
GPU_MEMORY_UTILIZATION=0.9

# Or disable validation
yo ml-container-creator --validate-env-vars=false

Deprecated Environment Variable¶

Symptom:

Warning: Environment variable 'OLD_FLAG' is deprecated, use 'NEW_FLAG' instead

Solution:

# Update to use new flag name

# Old (deprecated)
OLD_FLAG=value

# New (recommended)
NEW_FLAG=value

# Check deprecation info in registry
{
  "OLD_FLAG": {
    "deprecated": true,
    "replacement": "NEW_FLAG",
    "deprecatedSince": "0.5.0"
  }
}

Validation Disabled in Tests¶

Symptom:

# Tests pass locally but fail in CI
Error: Environment variable validation failed

Solution:

# Ensure VALIDATE_ENV_VARS=false in test environment

# In test files
process.env.VALIDATE_ENV_VARS = 'false';

# In CI configuration
env:
  VALIDATE_ENV_VARS: false

# In npm scripts
"test": "VALIDATE_ENV_VARS=false mocha"

# Verify in tests
console.log('VALIDATE_ENV_VARS:', process.env.VALIDATE_ENV_VARS);

Registry Schema Issues¶

Schema Validation Failed¶

Symptom:

Error: Framework registry validation failed: data.vllm.0.5.0 should have required property 'baseImage'

Solution:

// Ensure all required fields are present

// Framework Registry required fields:
{
  "baseImage": "string",
  "accelerator": {
    "type": "cuda|neuron|cpu|rocm",
    "version": "string|null"
  },
  "envVars": {},
  "inferenceAmiVersion": "string",
  "recommendedInstanceTypes": ["string"],
  "validationLevel": "tested|community-validated|experimental|unknown"
}

// Model Registry required fields:
{
  "family": "string",
  "chatTemplate": "string|null",
  "requiresTemplate": boolean,
  "validationLevel": "tested|community-validated|experimental",
  "frameworkCompatibility": {}
}

// Instance Accelerator Mapping required fields:
{
  "family": "string",
  "accelerator": {
    "type": "cuda|neuron|cpu|rocm",
    "hardware": "string",
    "architecture": "string",
    "versions": ["string"]|null,
    "default": "string|null"
  },
  "memory": "string",
  "vcpus": number
}

// Run schema validation
npm test -- --grep "schema"

Invalid Accelerator Type¶

Symptom:

Error: data.accelerator.type should be equal to one of the allowed values: cuda, neuron, cpu, rocm

Solution:

// Use valid accelerator type
{
  "accelerator": {
    "type": "cuda",  // Must be: cuda, neuron, cpu, or rocm
    "version": "12.1"
  }
}

// For new accelerator types:
// 1. Create custom validator (see docs/ACCELERATOR_VALIDATOR_GUIDE.md)
// 2. Update schema to include new type
// 3. Add to ValidationEngine

Invalid Validation Level¶

Symptom:

Error: data.validationLevel should be equal to one of the allowed values

Solution:

// Use valid validation level

// Framework Registry:
"validationLevel": "tested"  // or "community-validated", "experimental", "unknown"

// Model Registry:
"validationLevel": "tested"  // or "community-validated", "experimental"

// Validation level criteria:
// - tested: Deployed and validated on AWS
// - community-validated: Built and tested by community
// - experimental: Passes automated tests only
// - unknown: No validation data

Missing Required Field¶

Symptom:

Error: data should have required property 'recommendedInstanceTypes'

Solution:

// Add missing required field
{
  "vllm": {
    "0.5.0": {
      "baseImage": "vllm/vllm-openai:v0.5.0",
      "accelerator": { "type": "cuda", "version": "12.1" },
      "envVars": {},
      "inferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1",
      "recommendedInstanceTypes": ["ml.g5.xlarge"],  // Add this
      "validationLevel": "tested"
    }
  }
}

// Check schema definition
cat generators/app/config/schemas/framework-registry-schema.js

Pattern Matching Not Working¶

Symptom: Model pattern mistral* doesn't match mistralai/Mistral-7B-v0.1

Solution:

// Use correct pattern syntax

// Correct patterns:
"mistralai/Mistral-*"     // Matches mistralai/Mistral-7B-v0.1
"meta-llama/Llama-2-*"    // Matches meta-llama/Llama-2-7b-hf
"google/gemma-*"          // Matches google/gemma-2b

// Incorrect patterns:
"mistral*"                // Won't match mistralai/Mistral-7B-v0.1
"*Mistral*"               // Too broad, may match unintended models

// Pattern matching is case-sensitive
// Exact matches take precedence over patterns

Getting Help¶

Check Logs¶

# Container logs
docker logs <container-id>

# SageMaker endpoint logs
aws logs tail /aws/sagemaker/Endpoints/my-model --follow

# Build logs
docker build -t my-model . 2>&1 | tee build.log

Enable Debug Mode¶

# Add to serve.py
import logging
logging.basicConfig(level=logging.DEBUG)

# Add detailed error handling
try:
    prediction = model.predict(data)
except Exception as e:
    logger.error(f"Prediction failed: {str(e)}", exc_info=True)
    raise

Test Components Individually¶

# Test model loading
docker run -it my-model python
>>> from code.model_handler import ModelHandler
>>> handler = ModelHandler('/opt/ml/model')

# Test Flask app
docker run -it my-model python code/serve.py

# Test inference
curl -X POST http://localhost:8080/invocations -d '...'

Community Support¶

GitHub Issues: Report bugs
Discussions: Ask questions
Documentation: Read guides
Examples: See examples

AWS Support¶

SageMaker Documentation: https://docs.aws.amazon.com/sagemaker/
AWS Support: https://console.aws.amazon.com/support/
AWS Forums: https://forums.aws.amazon.com/forum.jspa?forumID=285

Prevention Tips¶

Before Generating¶

✅ Verify Node.js version (24.11.1+) - use nvm use node for latest
✅ Have model file ready
✅ Know model format and framework
✅ Have AWS credentials configured
✅ Install Yeoman globally: npm install -g yo@latest

Before Building¶

✅ Test model loading locally
✅ Verify all dependencies in requirements.txt
✅ Check model file size
✅ Review Dockerfile
✅ Run linting: npm run lint -- --fix

Before Deploying¶

✅ Test container locally
✅ Verify IAM role permissions
✅ Check ECR repository exists
✅ Confirm instance type availability
✅ Run comprehensive tests: ./scripts/test-generate-projects.sh

Before Production¶

✅ Load test endpoint
✅ Set up monitoring and alarms
✅ Configure auto-scaling
✅ Document deployment process
✅ Plan rollback strategy

Development Best Practices¶

✅ Use nvm use node when encountering ES6 import errors
✅ Run tests in isolated directories to avoid conflicts
✅ Keep test output for debugging: KEEP_TEST_OUTPUT=true
✅ Use verbose mode for troubleshooting: --verbose
✅ Clean up environment variables between tests
✅ Only fail CI on critical security vulnerabilities
✅ Use npm overrides for dependency security patches