AWS Best Practices for Production Deployment
This guide outlines AWS best practices for deploying the GenAI IDP Accelerator to production environments. These recommendations help ensure security, reliability, cost optimization, and operational excellence.
Production Readiness
This solution is in experimental stage. Implement all applicable best practices and conduct thorough testing before production deployment.
Compute & Performance
Lambda Functions
Concurrent Execution Limits
- Check: Ensure that AWS Lambda function is configured for function-level concurrent execution limit
- Current Status: 43 Lambda functions without configured limits
- Risk Level: Medium
- Recommendation: Configure reserved concurrency for each Lambda function to prevent runaway costs
Why This Matters:
- Protects against unexpected spikes in invocations that could exhaust your AWS account's Lambda concurrency limit (default 1,000 concurrent executions per region)
- Prevents runaway costs from infinite loops or DDoS attacks
- Ensures critical functions have guaranteed capacity
- Protects downstream services from being overwhelmed
Recommended Limits by Function Type:
- High-volume processing functions (e.g., document classification): 50-100
- API handlers: 20-50
- Background jobs: 10-20
- Administrative functions: 5-10
Dead Letter Queue (DLQ)
- Check: Ensure that AWS Lambda function is configured for a Dead Letter Queue
- Current Status: 37 Lambda functions without configured DLQ
- Risk Level: Medium
- Recommendation: Configure DLQ for Lambda functions to capture and analyze failed invocations
Why This Matters:
- Captures failed asynchronous invocations for debugging and retry logic
- Prevents silent failures and data loss
- Enables monitoring and alerting on processing failures
- Provides visibility into error patterns and system issues
When to Use DLQ:
- Asynchronous Lambda invocations (S3 events, SNS, EventBridge)
- Critical processing workflows where failures must be tracked
- Functions with retry logic that may eventually fail
- Not needed for synchronous invocations (API Gateway, direct invokes)
VPC Configuration
- Check: Ensure that AWS Lambda function is configured inside a VPC
- Current Status: 10 Lambda functions without VPC configuration
- Risk Level: Low to Medium
- Recommendation: Configure Lambda functions to run inside a VPC when accessing private resources
Why This Matters:
- Required for accessing resources in private subnets (RDS, ElastiCache, internal APIs)
- Provides network-level isolation and security controls
- Enables use of security groups and network ACLs
- Required for compliance with certain security standards
When to Use VPC:
- Functions accessing RDS databases or other VPC-only resources
- Functions requiring private network connectivity
- Compliance requirements mandate network isolation
- Not needed for functions only accessing public AWS services (S3, DynamoDB, Bedrock)
Trade-offs:
- Adds cold start latency (mitigated with Hyperplane ENIs in newer Lambda runtime)
- Requires NAT Gateway for internet access (additional cost)
- More complex networking configuration
Lambda Environment Variable Encryption
- Check: Check encryption settings for Lambda environmental variables
- Current Status: 39 Lambda functions without environment variable encryption
- Risk Level: High
- Recommendation: Configure Lambda functions to encrypt environment variables using KMS customer-managed keys
Why This Matters:
- Environment variables often contain sensitive data (API keys, database credentials, secrets)
- Default encryption uses AWS-managed keys with limited control and auditability
- Customer-managed keys provide detailed access control and audit trails
- Prevents unauthorized access to sensitive configuration data
Best Practices:
- Use AWS Secrets Manager or Parameter Store for highly sensitive data instead of environment variables
- Rotate encryption keys regularly
- Apply least-privilege access to KMS keys
- Monitor key usage through CloudTrail
Lambda Code Signing
- Check: Ensure AWS Lambda function is configured to validate code-signing
- Current Status: 46 Lambda functions without code-signing validation
- Risk Level: Medium
- Recommendation: Configure Lambda functions to validate code signatures to ensure code integrity
Why This Matters:
- Ensures only trusted code is deployed to Lambda functions
- Prevents deployment of unauthorized or tampered code
- Provides audit trail of who signed and deployed code
- Required for compliance with certain security frameworks
When to Use Code Signing:
- Production environments with strict security requirements
- Regulated industries (finance, healthcare, government)
- Multi-team environments where code provenance is critical
- Organizations with formal change management processes
Trade-offs:
- Adds complexity to deployment pipeline
- Requires managing signing profiles and certificates
- May slow down deployment process
(Additional compute recommendations will be added here)
Security & Encryption
Data Encryption
CodeBuild Encryption with CMK
- Check: Ensure that CodeBuild projects are encrypted using Customer Managed Keys (CMK)
- Current Status: 3 CodeBuild projects without CMK encryption
- Risk Level: Medium
- Recommendation: Configure CodeBuild projects to use customer-managed KMS keys for encryption
Why This Matters:
- Provides greater control over encryption keys and access policies
- Enables detailed audit trails through CloudTrail for key usage
- Supports compliance requirements for customer-managed encryption
- Allows key rotation policies aligned with organizational security standards
What Gets Encrypted:
- Build artifacts stored in S3
- Build environment variables containing sensitive data
- Cache data used during builds
- Build logs in CloudWatch
CloudWatch Log Group Encryption
- Check: Ensure that CloudWatch Log Group is encrypted by KMS
- Current Status: 7 CloudWatch Log Groups without KMS encryption
- Risk Level: Medium
- Recommendation: Configure CloudWatch Log Groups to use KMS encryption for log data at rest
Why This Matters:
- Protects sensitive information in application logs (API keys, user data, system details)
- Meets compliance requirements for data encryption at rest
- Provides audit trail of who accessed log data
- Enables fine-grained access control through KMS key policies
Affected Log Groups:
- Lambda function logs
- Step Functions execution logs
- API Gateway access logs
- Application and system logs
(Additional security recommendations will be added here)
Monitoring & Observability
CloudWatch Integration
Alarms and Alerts
- Recommendation: Set up CloudWatch alarms for critical metrics and error rates
- Implementation: (To be added)
- Rationale: Enables proactive monitoring and rapid incident response
CloudWatch Log Retention
- Check: Ensure CloudWatch log groups retain logs for at least 1 year
- Current Status: 37 CloudWatch Log Groups without adequate retention
- Risk Level: Low to Medium
- Recommendation: Configure CloudWatch Log Groups with appropriate retention periods based on compliance and operational needs
Why This Matters:
- Required for compliance with data retention regulations (SOC 2, HIPAA, PCI-DSS)
- Enables historical analysis and troubleshooting of past incidents
- Supports security investigations and audit requirements
- Prevents indefinite log storage costs
Recommended Retention Periods:
- Production logs: 1 year minimum (365 days)
- Security and audit logs: 2-7 years depending on compliance requirements
- Development/test logs: 30-90 days
- Debug logs: 7-30 days
Trade-offs:
- Longer retention increases storage costs
- Balance compliance requirements with cost optimization
- Consider archiving to S3 for long-term retention at lower cost
(Additional monitoring recommendations will be added here)
Cost Optimization
Resource Management
Tagging Strategy
- Recommendation: Implement comprehensive tagging strategy for cost allocation
- Implementation: (To be added)
- Rationale: Enables accurate cost tracking and optimization opportunities
(Additional cost optimization recommendations will be added here)
Reliability & Resilience
High Availability
Multi-AZ Deployment
- Recommendation: Deploy critical components across multiple availability zones
- Implementation: (To be added)
- Rationale: Ensures service availability during AZ-level failures
CloudFront Distribution
Origin Failover Configuration
- Check: Ensure CloudFront distributions have origin failover configured
- Current Status: 1 CloudFront distribution without origin failover
- Risk Level: Medium
- Recommendation: Configure CloudFront with origin groups for automatic failover to secondary origins
Why This Matters:
- Ensures high availability of web UI and content delivery
- Automatically routes traffic to backup origin if primary fails
- Reduces downtime and improves user experience
- Provides resilience against origin failures
Use Cases:
- Primary S3 bucket with failover to replica in another region
- Primary origin with backup origin for redundancy
- Multi-region disaster recovery scenarios
Geo Restriction
- Check: Ensure AWS CloudFront web distribution has geo restriction enabled
- Current Status: 1 CloudFront distribution without geo restriction configured
- Risk Level: Low
- Recommendation: Configure geo restrictions based on your application's geographic requirements and compliance needs
Why This Matters:
- Helps comply with data residency and export control regulations
- Reduces exposure to attacks from specific geographic regions
- Controls content distribution based on licensing agreements
- Can reduce costs by limiting traffic to specific regions
When to Use:
- Compliance requirements restrict access to certain countries
- Content licensing limited to specific geographic regions
- Security policy requires blocking high-risk regions
- Not needed if application serves global audience without restrictions
(Additional reliability recommendations will be added here)
Implementation Checklist
Use this checklist to track your production readiness:
- [ ] Lambda concurrent execution limits configured
- [ ] Customer-managed KMS keys implemented
- [ ] CloudWatch alarms configured
- [ ] Comprehensive resource tagging applied
- [ ] Multi-AZ deployment verified
- [ ] Security review completed
- [ ] Load testing performed
- [ ] Disaster recovery plan documented
- [ ] Monitoring dashboards created
- [ ] Cost alerts configured
Next Steps
- Review each recommendation in detail
- Prioritize based on your organization's requirements
- Implement recommendations incrementally
- Test thoroughly in non-production environments
- Document any deviations from best practices