Monitoring and Observability¶
Overview¶
Comprehensive monitoring and observability for the Open Host Factory Plugin in production environments.
Health Checks¶
Basic Health Check¶
# Check application health
curl http://localhost:8000/health
# Expected response
{
"status": "healthy",
"service": "open-hostfactory-plugin",
"version": "1.0.0",
"timestamp": "2025-01-07T10:00:00Z"
}
Kubernetes Health Checks¶
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
Metrics Collection¶
Prometheus Integration¶
# prometheus.yml
scrape_configs:
- job_name: 'ohfp-api'
static_configs:
- targets: ['ohfp-api:8000']
metrics_path: /metrics
scrape_interval: 30s
Custom Metrics¶
The application exposes metrics for:
- Request count and duration
- Authentication success/failure rates
- AWS API call metrics
- Error rates by endpoint
- Resource provisioning metrics
Logging¶
Structured Logging¶
{
"logging": {
"level": "INFO",
"format": "json",
"file_enabled": true,
"file_path": "/app/logs/app.log",
"console_enabled": true
}
}
Log Aggregation¶
ELK Stack¶
# Filebeat configuration
filebeat.inputs:
- type: log
paths:
- /app/logs/*.log
fields:
service: ohfp-api
fields_under_root: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
Fluentd¶
<source>
@type tail
path /app/logs/app.log
pos_file /var/log/fluentd/ohfp.log.pos
tag ohfp.api
format json
</source>
<match ohfp.**>
@type elasticsearch
host elasticsearch
port 9200
index_name ohfp-logs
</match>
Alerting¶
Prometheus Alerts¶
groups:
- name: ohfp-api
rules:
- alert: OHFPAPIDown
expr: up{job="ohfp-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "OHFP API is down"
- alert: OHFPHighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
AWS CloudWatch¶
{
"MetricName": "APIErrors",
"Namespace": "OHFP/API",
"Dimensions": [
{
"Name": "Service",
"Value": "ohfp-api"
}
],
"Value": 1,
"Unit": "Count"
}
Dashboards¶
Grafana Dashboard¶
Key metrics to monitor:
- Request rate and response time
- Error rates by endpoint
- Authentication success/failure
- AWS API call latency
- Resource provisioning success rate
- Container resource usage
Sample Queries¶
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# Response time percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Distributed Tracing¶
OpenTelemetry¶
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
span_processor = BatchSpanProcessor(jaeger_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
Performance Monitoring¶
Application Performance¶
Monitor: - Memory usage and garbage collection - CPU utilization - Database connection pool usage - AWS API rate limiting - Cache hit rates
Infrastructure Monitoring¶
Monitor: - Container resource usage - Network latency - Disk I/O - Load balancer metrics - Auto-scaling events
For complete monitoring setup, see the deployment guide.