AWS SageMaker BYOC Context¶
SageMaker Container Requirements¶
Mandatory Endpoints¶
All SageMaker inference containers must implement:
/ping(GET) - Health check endpoint- Returns 200 if container is healthy
- Called periodically by SageMaker
-
Should be lightweight and fast
-
/invocations(POST) - Inference endpoint - Receives prediction requests
- Returns predictions
- Must handle the content-type specified in request
Container Specifications¶
- Port: Must listen on port 8080
- Model location:
/opt/ml/model/(SageMaker mounts model artifacts here) - Timeout: Respond to /ping within 2 seconds
- Memory: Container should handle OOM gracefully
Environment Variables¶
SageMaker provides these environment variables:
SM_MODEL_DIR=/opt/ml/model # Model artifacts location
SM_NUM_GPUS=1 # Number of GPUs available
SM_NUM_CPUS=4 # Number of CPUs available
SM_LOG_LEVEL=INFO # Logging level
SM_NETWORK_INTERFACE_NAME=eth0 # Network interface
Deployment Architecture¶
Standard Flow¶
1. Build Docker Image
└─> docker build -t my-model .
2. Push to ECR
├─> aws ecr create-repository
├─> docker tag my-model:latest <ecr-url>
└─> docker push <ecr-url>
3. Create SageMaker Model
└─> aws sagemaker create-model
├─> ModelName
├─> ExecutionRoleArn
└─> PrimaryContainer
├─> Image (ECR URL)
└─> ModelDataUrl (S3 path, optional)
4. Create Endpoint Configuration
└─> aws sagemaker create-endpoint-config
├─> EndpointConfigName
└─> ProductionVariants
├─> InstanceType
├─> InitialInstanceCount
└─> ModelName
5. Create Endpoint
└─> aws sagemaker create-endpoint
├─> EndpointName
└─> EndpointConfigName
6. Wait for Endpoint
└─> aws sagemaker describe-endpoint
└─> Status: InService
Transformer Models Flow (Additional Step)¶
0. Upload Model to S3
└─> aws s3 cp model/ s3://bucket/model/ --recursive
Then follow standard flow with ModelDataUrl pointing to S3
IAM Permissions¶
Required SageMaker Execution Role Permissions¶
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-model-bucket/*",
"arn:aws:s3:::my-model-bucket"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
}
]
}
Required User/CI Permissions¶
{
"Effect": "Allow",
"Action": [
"ecr:CreateRepository",
"ecr:PutImage",
"ecr:InitiateLayerUpload",
"ecr:UploadLayerPart",
"ecr:CompleteLayerUpload",
"sagemaker:CreateModel",
"sagemaker:CreateEndpointConfig",
"sagemaker:CreateEndpoint",
"sagemaker:DescribeEndpoint",
"sagemaker:DeleteEndpoint",
"sagemaker:DeleteEndpointConfig",
"sagemaker:DeleteModel",
"iam:PassRole"
],
"Resource": "*"
}
Instance Types¶
ML Container Creator provides three instance type options:
Predefined Instance Types¶
CPU-Optimized Instances (Default: ml.m6g.large)¶
- ml.m5.xlarge - 4 vCPU, 16 GB RAM - Good for small models
- ml.m5.2xlarge - 8 vCPU, 32 GB RAM - Medium models
- ml.m5.4xlarge - 16 vCPU, 64 GB RAM - Large models
- ml.c5.xlarge - 4 vCPU, 8 GB RAM - Compute-intensive
- ml.m6g.large - 2 vCPU, 8 GB RAM - Default for traditional ML - ARM-based, cost-effective
GPU-Enabled Instances¶
- ml.g4dn.xlarge - 1 GPU (16GB), 4 vCPU - Small LLMs
- ml.g4dn.2xlarge - 1 GPU (16GB), 8 vCPU - Medium LLMs
- ml.g5.xlarge - 1 GPU (24GB), 4 vCPU - Default for traditional ML with GPU - Larger models
- ml.g5.2xlarge - 1 GPU (24GB), 8 vCPU - Better performance
- ml.g6.12xlarge - 4 GPUs (96GB), 48 vCPU - Default for transformers - Large LLMs
- ml.p3.2xlarge - 1 GPU (16GB V100) - Training/large inference
- ml.p4d.24xlarge - 8 GPUs (40GB A100) - Very large models
Custom Instance Types¶
You can specify any AWS SageMaker instance type using the custom option:
# CLI usage
yo ml-container-creator --instance-type=custom --custom-instance-type=ml.g4dn.xlarge
# Configuration file
{
"instanceType": "custom",
"customInstanceType": "ml.inf1.xlarge"
}
# Environment variables
export ML_INSTANCE_TYPE=custom
export ML_CUSTOM_INSTANCE_TYPE=ml.g4dn.2xlarge
Popular Custom Instance Types¶
- ml.inf1.xlarge - AWS Inferentia chip - Optimized for inference
- ml.inf1.2xlarge - AWS Inferentia chip - Higher throughput
- ml.t3.medium - 2 vCPU, 4 GB RAM - Development/testing
- ml.c5n.xlarge - 4 vCPU, 10.5 GB RAM - Network-optimized
- ml.r5.large - 2 vCPU, 16 GB RAM - Memory-optimized
Instance Type Selection Guide¶
| Use Case | Recommended Instance Type | Rationale |
|---|---|---|
| Development/Testing | ml.t3.medium (custom) |
Low cost, sufficient for testing |
| Small Traditional ML | cpu-optimized (ml.m6g.large) |
Cost-effective, ARM-based |
| Large Traditional ML | ml.m5.xlarge (custom) |
More memory and compute |
| Deep Learning Models | gpu-enabled (ml.g5.xlarge) |
GPU acceleration |
| Large Language Models | gpu-enabled (ml.g6.12xlarge) |
Multiple GPUs, high memory |
| Inference Optimization | ml.inf1.xlarge (custom) |
AWS Inferentia chips |
| High Throughput | ml.c5n.xlarge (custom) |
Network-optimized |
Default Mappings¶
The generator automatically maps abstract instance types to specific AWS instances:
# In deploy/deploy.sh template
if instanceType === 'cpu-optimized':
INSTANCE_TYPE="ml.m6g.large"
elif instanceType === 'gpu-enabled' && framework === 'transformers':
INSTANCE_TYPE="ml.g6.12xlarge"
elif instanceType === 'gpu-enabled':
INSTANCE_TYPE="ml.g5.xlarge"
elif instanceType === 'custom':
INSTANCE_TYPE="${customInstanceType}"
Cost Considerations¶
- Start with smallest instance that fits your model
- Use CPU instances for traditional ML (sklearn, xgboost)
- Use GPU instances for deep learning and transformers
- Consider multi-model endpoints for cost optimization
Model Artifacts¶
Traditional ML Models¶
Transformer Models¶
/opt/ml/model/
├── config.json # Model configuration
├── pytorch_model.bin # Model weights
├── tokenizer.json # Tokenizer
├── tokenizer_config.json
└── special_tokens_map.json
Model Loading Best Practices¶
import os
# Use environment variable
model_dir = os.environ.get('SM_MODEL_DIR', '/opt/ml/model')
# Check if model exists
model_path = os.path.join(model_dir, 'model.pkl')
if not os.path.exists(model_path):
raise FileNotFoundError(f"Model not found at {model_path}")
# Load model once at startup, not per request
model = load_model(model_path)
Request/Response Format¶
JSON Input (Traditional ML)¶
JSON Output (Traditional ML)¶
Text Input (Transformers)¶
{
"inputs": "What is the capital of France?",
"parameters": {
"max_new_tokens": 100,
"temperature": 0.7
}
}
Content-Type Handling¶
@app.route('/invocations', methods=['POST'])
def predict():
content_type = request.headers.get('Content-Type', 'application/json')
if content_type == 'application/json':
data = request.get_json()
elif content_type == 'text/csv':
data = parse_csv(request.data)
else:
return Response(
f"Unsupported content type: {content_type}",
status=415
)
Logging¶
CloudWatch Logs¶
- All stdout/stderr goes to CloudWatch Logs
- Log group:
/aws/sagemaker/Endpoints/{endpoint-name} - Use structured logging for better searchability
Logging Best Practices¶
import logging
import json
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Log structured data
logger.info(json.dumps({
'event': 'prediction',
'latency_ms': 45,
'input_size': 1024,
'status': 'success'
}))
# Log errors with context
try:
prediction = model.predict(data)
except Exception as e:
logger.error(f"Prediction failed: {str(e)}", exc_info=True)
raise
Testing Endpoints¶
Local Testing¶
# Test health check
curl http://localhost:8080/ping
# Test inference
curl -X POST http://localhost:8080/invocations \
-H "Content-Type: application/json" \
-d '{"instances": [[1.0, 2.0, 3.0]]}'
SageMaker Endpoint Testing¶
# Using AWS CLI
aws sagemaker-runtime invoke-endpoint \
--endpoint-name my-endpoint \
--body '{"instances": [[1.0, 2.0, 3.0]]}' \
--content-type application/json \
output.json
# Using Python SDK
import boto3
import json
runtime = boto3.client('sagemaker-runtime')
response = runtime.invoke_endpoint(
EndpointName='my-endpoint',
ContentType='application/json',
Body=json.dumps({'instances': [[1.0, 2.0, 3.0]]})
)
result = json.loads(response['Body'].read())
Common Issues & Solutions¶
Container Fails to Start¶
- Check CloudWatch logs for errors
- Verify model files exist in /opt/ml/model
- Ensure port 8080 is exposed and listening
- Check for Python import errors
Endpoint Creation Fails¶
- Verify IAM role has correct permissions
- Check ECR image exists and is accessible
- Ensure instance type is available in region
- Verify model data URL is correct (if using S3)
Slow Inference¶
- Model loading on every request (load once at startup)
- Large model size (consider model optimization)
- Insufficient instance resources (upgrade instance type)
- Network latency (use VPC endpoints)
Out of Memory¶
- Model too large for instance (upgrade instance type)
- Memory leak in inference code (profile and fix)
- Batch size too large (reduce batch size)
- Multiple models loaded (use multi-model endpoint)
Cost Optimization¶
Strategies¶
- Right-size instances - Don't over-provision
- Use auto-scaling - Scale down during low traffic
- Serverless inference - For sporadic workloads
- Multi-model endpoints - Share instance across models
- Spot instances - For non-critical workloads (not supported for endpoints)
- Delete unused endpoints - Stop paying for idle resources
Monitoring Costs¶
# Check endpoint costs
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics BlendedCost \
--filter file://filter.json
# filter.json
{
"Dimensions": {
"Key": "SERVICE",
"Values": ["Amazon SageMaker"]
}
}
Security Best Practices¶
Network Security¶
- Deploy in VPC for private endpoints
- Use VPC endpoints for AWS service access
- Restrict security group ingress rules
- Enable encryption in transit
Data Security¶
- Encrypt model artifacts in S3 (SSE-S3 or SSE-KMS)
- Use IAM roles, not access keys
- Enable CloudTrail for audit logging
- Rotate credentials regularly
Container Security¶
- Use minimal base images
- Scan images for vulnerabilities
- Don't run as root user
- Keep dependencies updated
- Remove unnecessary tools from production images
Regional Considerations¶
Available Regions¶
SageMaker is available in most AWS regions, but: - Some instance types are region-specific - Pricing varies by region - Consider data residency requirements - Use same region as your data for lower latency