Troubleshooting¶
Common issues and solutions when using ML Container Creator.
Quick Reference¶
| Issue | Fix |
|---|---|
SyntaxError: Unexpected token 'export' |
nvm use node (requires Node.js 24.11.1+) |
| Generator not found | npm link in the project directory |
| Docker build fails: package not found | Pin versions in requirements.txt |
| Container exits immediately | docker logs <id> to check for errors |
/ping returns connection refused |
Verify container is running and port 8080 is exposed |
| ECR authentication failed | aws ecr get-login-password (see ECR Auth) |
| Endpoint stuck in Creating | Check CloudWatch logs for health check or model loading failures |
| Model file not found in container | Verify COPY directive in Dockerfile targets /opt/ml/model/ |
| HuggingFace API timeout | Use --offline flag |
| HuggingFace access denied | Verify token and model license agreement |
Generator Issues¶
Generator Not Found¶
Node.js Version Error¶
MCC requires Node.js 24.11.1+ for ES module support:
CLI Not Found in CI¶
Docker Build Issues¶
Package Not Found¶
Pin versions in requirements.txt:
Or rebuild without cache: docker build --no-cache -t my-model .
Permission Denied¶
Local Testing Issues¶
Container Exits Immediately¶
docker logs <container-id> # Check for startup errors
docker run -it my-model /bin/bash # Debug interactively
Common causes: missing model file at /opt/ml/model/, missing Python dependencies, syntax errors in serve.py.
Health Check Fails¶
If the container is running but not responding, the server may have failed to bind to port 8080. Check that the Dockerfile exposes port 8080 and the server is configured to listen on 0.0.0.0:8080.
Inference Returns Error¶
Common causes:
- Wrong input format -- ensure
Content-Type: application/jsonheader is set - Feature count mismatch -- input must have the same number of features as training data
- Wrong data types -- ensure numeric values are floats, not strings
AWS Deployment Issues¶
ECR Authentication Failed¶
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin \
<account-id>.dkr.ecr.us-east-1.amazonaws.com
Verify credentials: aws sts get-caller-identity
IAM Permission Denied¶
Required permissions for deployment: ecr:CreateRepository, ecr:PutImage, ecr:InitiateLayerUpload, ecr:UploadLayerPart, ecr:CompleteLayerUpload, sagemaker:CreateModel, sagemaker:CreateEndpointConfig, sagemaker:CreateEndpoint, iam:PassRole. See the generated IAM_PERMISSIONS.md for the full policy document.
Endpoint Creation Failed¶
Common causes:
- Invalid IAM role -- verify with
aws iam get-role --role-name <role> - Image not found in ECR -- verify with
aws ecr describe-images --repository-name <repo> - Insufficient instance capacity -- try a different instance type or region
Endpoint Stuck in Creating¶
If the endpoint stays in Creating status for more than 15 minutes, check CloudWatch logs. Common causes:
- Container fails to start -- check Dockerfile
CMD/ENTRYPOINTand verify port 8080 is exposed - Health check fails --
/pingmust return 200 within 2 seconds - Model loading fails -- verify model file exists in the container and format matches the handler code
To recover: delete the endpoint and redeploy.
Endpoint Returns 500¶
Common causes: exception in prediction code, wrong input format, model not loaded. Test locally first:
./do/run
curl -X POST http://localhost:8080/invocations \
-H "Content-Type: application/json" \
-d '{"instances": [[1.0, 2.0, 3.0]]}'
Model Loading Issues¶
Model File Not Found¶
# Verify model is in the container
docker run my-model ls -la /opt/ml/model/
# Check the COPY directive in Dockerfile
grep "COPY.*model" Dockerfile
For predictive models, the model file must be copied into the container at build time. For transformer models, the serving framework downloads the model at runtime from HuggingFace Hub.
Model Format Mismatch¶
Verify the model file format matches what you selected during generation:
| Framework | Expected formats |
|---|---|
| sklearn | .pkl, .joblib |
| xgboost | .json, .model, .ubj |
| tensorflow | SavedModel/ directory, .keras, .h5 |
Pickle Version Mismatch¶
The Python version used to save the model must match the version in the container. Either re-save the model with a compatible protocol (pickle.dump(model, f, protocol=4)) or update the Python version in the Dockerfile.
HuggingFace Issues¶
API Timeout¶
This is expected behavior -- the generator falls back to local registry data. To skip HuggingFace API calls entirely:
Access Denied or Repository Not Found¶
For private or gated models:
- Verify the model ID at
https://huggingface.co/<model-id> - Accept the model's license agreement on HuggingFace (for gated models like Llama)
- Provide a valid token:
--hf-token='$HF_TOKEN'
See HuggingFace Authentication for details.
Rate Limit Exceeded¶
Use --offline to skip API calls, or set HF_TOKEN for higher rate limits:
Getting Help¶
# Container logs
docker logs <container-id>
# SageMaker endpoint logs
aws logs tail /aws/sagemaker/Endpoints/<endpoint-name> --follow
# Generator debug output
DEBUG=* ml-container-creator
- GitHub Issues -- report bugs
- GitHub Discussions -- ask questions
- SageMaker Documentation -- AWS reference