Monitoring and Logging¶
This guide covers the logging, basic health checks, and monitoring capabilities of the Open Resource Broker.
Overview¶
The Open Resource Broker provides basic monitoring capabilities through:
- Application Logging: Detailed operation logs
- Health Checks: Basic system health monitoring
- Error Tracking: Error detection and logging
- Operation Tracking: Request and machine lifecycle logging
Logging¶
Log Configuration¶
Configure logging in your config.json:
Log Levels¶
- DEBUG: Detailed diagnostic information
- INFO: General operational information
- WARNING: Warning messages for potential issues
- ERROR: Error conditions
- CRITICAL: Critical errors that may cause failures
Log Format¶
The application uses structured logging:
2025-06-30 10:00:00,123 INFO [RequestService] Request created successfully request_id=req-123 template_id=template-1 machine_count=3
2025-06-30 10:00:01,456 ERROR [AWSProvider] Failed to provision machine error=InvalidParameterValue request_id=req-123
Log Analysis¶
Common Log Patterns¶
Request Lifecycle:
# Track request from creation to completion
grep "req-123" logs/app.log | grep -E "(created|status|completed)"
Error Analysis:
# Count error types
grep "ERROR" logs/app.log | cut -d']' -f2 | cut -d':' -f1 | sort | uniq -c
# Find recent errors
tail -100 logs/app.log | grep "ERROR"
Performance Analysis:
# Find slow operations
grep "slow" logs/app.log
# Track request duration
grep "Request.*completed" logs/app.log | grep -o "duration=[0-9]*"
Log Rotation¶
For production environments, set up log rotation:
Using logrotate (Linux)¶
Create /etc/logrotate.d/hostfactory:
/path/to/logs/app.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 644 hostfactory hostfactory
}
Manual Log Management¶
# Archive old logs
mv logs/app.log logs/app.log.$(date +%Y%m%d)
touch logs/app.log
# Compress old logs
gzip logs/app.log.*
# Clean up old logs (keep last 30 days)
find logs/ -name "app.log.*" -mtime +30 -delete
Health Checks¶
Basic Health Check¶
The application provides basic health check functionality:
# Check if the application can start and load configuration
python run.py getAvailableTemplates
# Should return templates or empty list without errors
AWS Connectivity Check¶
# Test AWS credentials and connectivity
aws sts get-caller-identity
# Test EC2 API access
aws ec2 describe-regions --region us-east-1
Configuration Validation¶
# Validate configuration file
python -c "
import json
with open('config/config.json') as f:
config = json.load(f)
print('Configuration is valid JSON')
print(f'Provider: {config.get(\"provider\", {}).get(\"type\", \"unknown\")}')
"
Storage Health Check¶
# Check data directory
ls -la data/
# Check if database file is accessible
if [ -f "data/request_database.json" ]; then
echo "Database file exists"
python -c "
import json
with open('data/request_database.json') as f:
data = json.load(f)
print(f'Database loaded successfully')
"
else
echo "Database file not found - will be created on first use"
fi
Error Monitoring¶
Error Types¶
The application logs various types of errors:
Configuration Errors¶
ERROR [ConfigManager] Failed to load configuration: File not found
ERROR [ConfigManager] Invalid JSON in configuration file
AWS Provider Errors¶
ERROR [AWSProvider] AWS API error: InvalidParameterValue
ERROR [AWSProvider] Failed to provision machine: InsufficientInstanceCapacity
ERROR [AWSProvider] Authentication failed: InvalidUserID.NotFound
Application Errors¶
ERROR [RequestService] Template not found: template-123
ERROR [RequestService] Invalid machine count: -1
ERROR [ApplicationService] Failed to create request: ValidationError
Error Tracking Script¶
Create a simple error monitoring script:
#!/bin/bash
# error_monitor.sh
LOG_FILE="logs/app.log"
ERROR_COUNT=$(grep "ERROR" "$LOG_FILE" | wc -l)
RECENT_ERRORS=$(tail -100 "$LOG_FILE" | grep "ERROR" | wc -l)
echo "Total errors: $ERROR_COUNT"
echo "Recent errors (last 100 lines): $RECENT_ERRORS"
if [ "$RECENT_ERRORS" -gt 5 ]; then
echo "WARNING: High error rate detected"
echo "Recent errors:"
tail -100 "$LOG_FILE" | grep "ERROR" | tail -5
fi
Operation Monitoring¶
Request Tracking¶
Monitor request lifecycle:
# Count active requests
python run.py getReturnRequests --active-only | jq '. | length'
# List recent requests
grep "Request.*created" logs/app.log | tail -10
# Track request completion
grep "Request.*completed" logs/app.log | tail -10
Machine Monitoring¶
Track machine provisioning:
# Count machines by status
python run.py getReturnRequests | jq '.[] | .machines[] | .status' | sort | uniq -c
# Monitor provisioning time
grep "Machine.*provisioned" logs/app.log | tail -10
AWS API Monitoring¶
Monitor AWS API usage:
# Count API calls
grep "AWS API" logs/app.log | wc -l
# Check for rate limiting
grep "rate limit" logs/app.log
# Monitor API errors
grep "AWS API.*error" logs/app.log | tail -10
Performance Monitoring¶
Response Time Tracking¶
Monitor command execution time:
# Time command execution
time python run.py getAvailableTemplates
# Monitor slow operations
grep "slow" logs/app.log
Resource Usage¶
Monitor system resources:
# Check memory usage
ps aux | grep python | grep run.py
# Check disk usage
du -sh data/ logs/
# Monitor file handles
lsof | grep python | wc -l
Database Performance¶
For JSON storage:
# Check database file size
ls -lh data/request_database.json
# Monitor database operations
grep "database" logs/app.log | tail -10
Alerting¶
Simple Email Alerts¶
Create a basic alerting script:
#!/bin/bash
# alert_check.sh
LOG_FILE="logs/app.log"
ALERT_EMAIL="admin@example.com"
# Check for critical errors
CRITICAL_ERRORS=$(grep "CRITICAL" "$LOG_FILE" | wc -l)
if [ "$CRITICAL_ERRORS" -gt 0 ]; then
echo "CRITICAL errors detected in Host Factory Plugin" | \
mail -s "Host Factory Alert" "$ALERT_EMAIL"
fi
# Check for high error rate
RECENT_ERRORS=$(tail -1000 "$LOG_FILE" | grep "ERROR" | wc -l)
if [ "$RECENT_ERRORS" -gt 50 ]; then
echo "High error rate detected: $RECENT_ERRORS errors in last 1000 log lines" | \
mail -s "Host Factory High Error Rate" "$ALERT_EMAIL"
fi
Cron Job Setup¶
# Add to crontab
crontab -e
# Check every 15 minutes
*/15 * * * * /path/to/alert_check.sh
# Daily log summary
0 8 * * * /path/to/daily_summary.sh
Monitoring Scripts¶
Daily Summary Script¶
#!/bin/bash
# daily_summary.sh
LOG_FILE="logs/app.log"
DATE=$(date +%Y-%m-%d)
echo "Host Factory Daily Summary - $DATE"
echo "=================================="
# Request statistics
echo "Requests:"
echo " Created: $(grep "Request.*created" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo " Completed: $(grep "Request.*completed" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo " Failed: $(grep "Request.*failed" "$LOG_FILE" | grep "$DATE" | wc -l)"
# Error statistics
echo "Errors:"
echo " Total: $(grep "ERROR" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo " AWS: $(grep "ERROR.*AWS" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo " Config: $(grep "ERROR.*Config" "$LOG_FILE" | grep "$DATE" | wc -l)"
# Machine statistics
echo "Machines:"
echo " Provisioned: $(grep "Machine.*provisioned" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo " Terminated: $(grep "Machine.*terminated" "$LOG_FILE" | grep "$DATE" | wc -l)"
Health Check Script¶
#!/bin/bash
# health_check.sh
echo "Host Factory Health Check"
echo "========================"
# Test basic functionality
echo -n "Basic functionality: "
if python run.py getAvailableTemplates > /dev/null 2>&1; then
echo "OK"
else
echo "FAILED"
fi
# Test AWS connectivity
echo -n "AWS connectivity: "
if aws sts get-caller-identity > /dev/null 2>&1; then
echo "OK"
else
echo "FAILED"
fi
# Check disk space
echo -n "Disk space: "
DISK_USAGE=$(df . | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -lt 90 ]; then
echo "OK ($DISK_USAGE%)"
else
echo "WARNING ($DISK_USAGE%)"
fi
# Check log file size
echo -n "Log file size: "
if [ -f "logs/app.log" ]; then
LOG_SIZE=$(du -m logs/app.log | cut -f1)
if [ "$LOG_SIZE" -lt 100 ]; then
echo "OK (${LOG_SIZE}MB)"
else
echo "WARNING (${LOG_SIZE}MB)"
fi
else
echo "No log file"
fi
Log Analysis Tools¶
Error Analysis¶
# Top error messages
grep "ERROR" logs/app.log | cut -d']' -f3 | sort | uniq -c | sort -nr | head -10
# Error timeline
grep "ERROR" logs/app.log | cut -d' ' -f1-2 | uniq -c
# AWS-specific errors
grep "ERROR.*AWS" logs/app.log | tail -20
Performance Analysis¶
# Slow operations
grep -E "(slow|timeout|delay)" logs/app.log
# Request duration analysis
grep "duration=" logs/app.log | grep -o "duration=[0-9]*" | sort -n
# API call frequency
grep "AWS API" logs/app.log | cut -d' ' -f1-2 | uniq -c
Troubleshooting Monitoring¶
Common Issues¶
Log File Not Created¶
# Check directory permissions
ls -la logs/
# Create directory if needed
mkdir -p logs
chmod 755 logs
High Log File Size¶
# Check log file size
ls -lh logs/app.log
# Rotate logs manually
mv logs/app.log logs/app.log.old
touch logs/app.log
Missing Health Check Data¶
# Verify configuration
python -c "
import json
with open('config/config.json') as f:
config = json.load(f)
print('Logging config:', config.get('logging', {}))
"
Integration with External Monitoring¶
Syslog Integration¶
Configure syslog forwarding:
# In logging configuration
{
"logging": {
"level": "INFO",
"file_path": "logs/app.log",
"console_enabled": true,
"syslog_enabled": true,
"syslog_facility": "local0"
}
}
Log Forwarding¶
Forward logs to centralized logging:
# Using rsyslog
echo "local0.* @@logserver:514" >> /etc/rsyslog.conf
systemctl restart rsyslog
# Using filebeat (ELK stack)
# Configure filebeat.yml to monitor logs/app.log
Next Steps¶
- Troubleshooting: Learn how to diagnose and fix issues
- Configuration: Configure logging and monitoring settings
- Deployment: Deploy with monitoring in production
- API Reference: Explore command-line interface