Monitoring and Logging¶

This guide covers the logging, basic health checks, and monitoring capabilities of the Open Host Factory Plugin.

Overview¶

The Open Host Factory Plugin provides basic monitoring capabilities through:

Application Logging: Detailed operation logs
Health Checks: Basic system health monitoring
Error Tracking: Error detection and logging
Operation Tracking: Request and machine lifecycle logging

Logging¶

Log Configuration¶

Configure logging in your config.json:

{
  "logging": {
    "level": "INFO",
    "file_path": "logs/app.log",
    "console_enabled": true
  }
}

Log Levels¶

DEBUG: Detailed diagnostic information
INFO: General operational information
WARNING: Warning messages for potential issues
ERROR: Error conditions
CRITICAL: Critical errors that may cause failures

Log Format¶

The application uses structured logging:

2025-06-30 10:00:00,123 INFO [RequestService] Request created successfully request_id=req-123 template_id=template-1 machine_count=3
2025-06-30 10:00:01,456 ERROR [AWSProvider] Failed to provision machine error=InvalidParameterValue request_id=req-123

Log Analysis¶

Common Log Patterns¶

Request Lifecycle:

# Track request from creation to completion
grep "req-123" logs/app.log | grep -E "(created|status|completed)"

Error Analysis:

# Count error types
grep "ERROR" logs/app.log | cut -d']' -f2 | cut -d':' -f1 | sort | uniq -c

# Find recent errors
tail -100 logs/app.log | grep "ERROR"

Performance Analysis:

# Find slow operations
grep "slow" logs/app.log

# Track request duration
grep "Request.*completed" logs/app.log | grep -o "duration=[0-9]*"

Log Rotation¶

For production environments, set up log rotation:

Using logrotate (Linux)¶

Create /etc/logrotate.d/hostfactory:

/path/to/logs/app.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 644 hostfactory hostfactory
}

Manual Log Management¶

# Archive old logs
mv logs/app.log logs/app.log.$(date +%Y%m%d)
touch logs/app.log

# Compress old logs
gzip logs/app.log.*

# Clean up old logs (keep last 30 days)
find logs/ -name "app.log.*" -mtime +30 -delete

Health Checks¶

Basic Health Check¶

The application provides basic health check functionality:

# Check if the application can start and load configuration
python run.py getAvailableTemplates

# Should return templates or empty list without errors

AWS Connectivity Check¶

# Test AWS credentials and connectivity
aws sts get-caller-identity

# Test EC2 API access
aws ec2 describe-regions --region us-east-1

Configuration Validation¶

# Validate configuration file
python -c "
import json
with open('config/config.json') as f:
    config = json.load(f)
    print('Configuration is valid JSON')
    print(f'Provider: {config.get(\"provider\", {}).get(\"type\", \"unknown\")}')
"

Storage Health Check¶

# Check data directory
ls -la data/

# Check if database file is accessible
if [ -f "data/request_database.json" ]; then
    echo "Database file exists"
    python -c "
import json
with open('data/request_database.json') as f:
    data = json.load(f)
    print(f'Database loaded successfully')
"
else
    echo "Database file not found - will be created on first use"
fi

Error Monitoring¶

Error Types¶

The application logs various types of errors:

Configuration Errors¶

ERROR [ConfigManager] Failed to load configuration: File not found
ERROR [ConfigManager] Invalid JSON in configuration file

AWS Provider Errors¶

ERROR [AWSProvider] AWS API error: InvalidParameterValue
ERROR [AWSProvider] Failed to provision machine: InsufficientInstanceCapacity
ERROR [AWSProvider] Authentication failed: InvalidUserID.NotFound

Application Errors¶

ERROR [RequestService] Template not found: template-123
ERROR [RequestService] Invalid machine count: -1
ERROR [ApplicationService] Failed to create request: ValidationError

Error Tracking Script¶

Create a simple error monitoring script:

#!/bin/bash
# error_monitor.sh

LOG_FILE="logs/app.log"
ERROR_COUNT=$(grep "ERROR" "$LOG_FILE" | wc -l)
RECENT_ERRORS=$(tail -100 "$LOG_FILE" | grep "ERROR" | wc -l)

echo "Total errors: $ERROR_COUNT"
echo "Recent errors (last 100 lines): $RECENT_ERRORS"

if [ "$RECENT_ERRORS" -gt 5 ]; then
    echo "WARNING: High error rate detected"
    echo "Recent errors:"
    tail -100 "$LOG_FILE" | grep "ERROR" | tail -5
fi

Operation Monitoring¶

Request Tracking¶

Monitor request lifecycle:

# Count active requests
python run.py getReturnRequests --active-only | jq '. | length'

# List recent requests
grep "Request.*created" logs/app.log | tail -10

# Track request completion
grep "Request.*completed" logs/app.log | tail -10

Machine Monitoring¶

Track machine provisioning:

# Count machines by status
python run.py getReturnRequests | jq '.[] | .machines[] | .status' | sort | uniq -c

# Monitor provisioning time
grep "Machine.*provisioned" logs/app.log | tail -10

AWS API Monitoring¶

Monitor AWS API usage:

# Count API calls
grep "AWS API" logs/app.log | wc -l

# Check for rate limiting
grep "rate limit" logs/app.log

# Monitor API errors
grep "AWS API.*error" logs/app.log | tail -10

Performance Monitoring¶

Response Time Tracking¶

Monitor command execution time:

# Time command execution
time python run.py getAvailableTemplates

# Monitor slow operations
grep "slow" logs/app.log

Resource Usage¶

Monitor system resources:

# Check memory usage
ps aux | grep python | grep run.py

# Check disk usage
du -sh data/ logs/

# Monitor file handles
lsof | grep python | wc -l

Database Performance¶

For JSON storage:

# Check database file size
ls -lh data/request_database.json

# Monitor database operations
grep "database" logs/app.log | tail -10

Alerting¶

Simple Email Alerts¶

Create a basic alerting script:

#!/bin/bash
# alert_check.sh

LOG_FILE="logs/app.log"
ALERT_EMAIL="admin@example.com"

# Check for critical errors
CRITICAL_ERRORS=$(grep "CRITICAL" "$LOG_FILE" | wc -l)

if [ "$CRITICAL_ERRORS" -gt 0 ]; then
    echo "CRITICAL errors detected in Host Factory Plugin" | \
    mail -s "Host Factory Alert" "$ALERT_EMAIL"
fi

# Check for high error rate
RECENT_ERRORS=$(tail -1000 "$LOG_FILE" | grep "ERROR" | wc -l)

if [ "$RECENT_ERRORS" -gt 50 ]; then
    echo "High error rate detected: $RECENT_ERRORS errors in last 1000 log lines" | \
    mail -s "Host Factory High Error Rate" "$ALERT_EMAIL"
fi

Cron Job Setup¶

# Add to crontab
crontab -e

# Check every 15 minutes
*/15 * * * * /path/to/alert_check.sh

# Daily log summary
0 8 * * * /path/to/daily_summary.sh

Monitoring Scripts¶

Daily Summary Script¶

#!/bin/bash
# daily_summary.sh

LOG_FILE="logs/app.log"
DATE=$(date +%Y-%m-%d)

echo "Host Factory Daily Summary - $DATE"
echo "=================================="

# Request statistics
echo "Requests:"
echo "  Created: $(grep "Request.*created" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  Completed: $(grep "Request.*completed" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  Failed: $(grep "Request.*failed" "$LOG_FILE" | grep "$DATE" | wc -l)"

# Error statistics
echo "Errors:"
echo "  Total: $(grep "ERROR" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  AWS: $(grep "ERROR.*AWS" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  Config: $(grep "ERROR.*Config" "$LOG_FILE" | grep "$DATE" | wc -l)"

# Machine statistics
echo "Machines:"
echo "  Provisioned: $(grep "Machine.*provisioned" "$LOG_FILE" | grep "$DATE" | wc -l)"
echo "  Terminated: $(grep "Machine.*terminated" "$LOG_FILE" | grep "$DATE" | wc -l)"

Health Check Script¶

#!/bin/bash
# health_check.sh

echo "Host Factory Health Check"
echo "========================"

# Test basic functionality
echo -n "Basic functionality: "
if python run.py getAvailableTemplates > /dev/null 2>&1; then
    echo "OK"
else
    echo "FAILED"
fi

# Test AWS connectivity
echo -n "AWS connectivity: "
if aws sts get-caller-identity > /dev/null 2>&1; then
    echo "OK"
else
    echo "FAILED"
fi

# Check disk space
echo -n "Disk space: "
DISK_USAGE=$(df . | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -lt 90 ]; then
    echo "OK ($DISK_USAGE%)"
else
    echo "WARNING ($DISK_USAGE%)"
fi

# Check log file size
echo -n "Log file size: "
if [ -f "logs/app.log" ]; then
    LOG_SIZE=$(du -m logs/app.log | cut -f1)
    if [ "$LOG_SIZE" -lt 100 ]; then
        echo "OK (${LOG_SIZE}MB)"
    else
        echo "WARNING (${LOG_SIZE}MB)"
    fi
else
    echo "No log file"
fi

Log Analysis Tools¶

Error Analysis¶

# Top error messages
grep "ERROR" logs/app.log | cut -d']' -f3 | sort | uniq -c | sort -nr | head -10

# Error timeline
grep "ERROR" logs/app.log | cut -d' ' -f1-2 | uniq -c

# AWS-specific errors
grep "ERROR.*AWS" logs/app.log | tail -20

Performance Analysis¶

# Slow operations
grep -E "(slow|timeout|delay)" logs/app.log

# Request duration analysis
grep "duration=" logs/app.log | grep -o "duration=[0-9]*" | sort -n

# API call frequency
grep "AWS API" logs/app.log | cut -d' ' -f1-2 | uniq -c

Troubleshooting Monitoring¶

Common Issues¶

Log File Not Created¶

# Check directory permissions
ls -la logs/

# Create directory if needed
mkdir -p logs
chmod 755 logs

High Log File Size¶

# Check log file size
ls -lh logs/app.log

# Rotate logs manually
mv logs/app.log logs/app.log.old
touch logs/app.log

Missing Health Check Data¶

# Verify configuration
python -c "
import json
with open('config/config.json') as f:
    config = json.load(f)
    print('Logging config:', config.get('logging', {}))
"

Integration with External Monitoring¶

Syslog Integration¶

Configure syslog forwarding:

# In logging configuration
{
  "logging": {
    "level": "INFO",
    "file_path": "logs/app.log",
    "console_enabled": true,
    "syslog_enabled": true,
    "syslog_facility": "local0"
  }
}

Log Forwarding¶

Forward logs to centralized logging:

# Using rsyslog
echo "local0.*    @@logserver:514" >> /etc/rsyslog.conf
systemctl restart rsyslog

# Using filebeat (ELK stack)
# Configure filebeat.yml to monitor logs/app.log

Next Steps¶

Troubleshooting: Learn how to diagnose and fix issues
Configuration: Configure logging and monitoring settings
Deployment: Deploy with monitoring in production
API Reference: Explore command-line interface