Observability
Philosophy
"If you can't measure it, you can't improve it. If you can't debug it in 5 minutes, your logging is broken."
Three Pillars
1. Logs (What happened?)
2. Metrics (How much happened?)
3. Traces (Where did time go?)
Logs
Structure
All logs are JSON for machine parsing:
{
"timestamp": "2026-01-26T10:30:45.123Z",
"level": "info",
"correlationId": "req-abc123-xyz789",
"service": "flexgate-proxy",
"event": "request.completed",
"http": {
"method": "GET",
"path": "/api/users/123",
"statusCode": 200,
"clientIp": "192.168.1.100",
"userAgent": "Mozilla/5.0..."
},
"upstream": {
"host": "api.backend.com",
"duration": 145,
"statusCode": 200
},
"duration": 152,
"metadata": {
"userId": "user-456",
"route": "api-users"
}
}Log Levels
| Level | Use Case | Example |
|---|---|---|
error | Request failed, upstream down, config invalid | "Upstream timeout after 30s" |
warn | Rate limit hit, slow response, auth retry | "Request exceeded rate limit" |
info | Every request/response (sampled in prod) | "Request completed" |
debug | Headers, body (dev only) | "Request headers: {...}" |
Sampling Strategy
Production: Log 100% of errors, 10% of successful requests (configurable)
logging:
level: "info"
sampling:
enabled: true
successRate: 0.1 # 10% of 2xx responses
errorRate: 1.0 # 100% of errorsWhy: At 5K req/sec, logging everything = 10GB/hour. Sampling preserves costs while keeping error visibility.
Correlation IDs
Every request gets a unique ID:
X-Correlation-ID: req-abc123-xyz789Propagated to:
- Upstream requests (forwarded header)
- All log entries for that request
- Error responses
- Metrics tags
Debugging flow:
- User reports error
- Get correlation ID from response
grep correlationId=req-abc123across all logs- See full request path
What We Log
✅ Always Log
- Request start/end
- Authentication attempts
- Rate limit hits
- Upstream failures
- Config reloads
- Circuit breaker state changes
⚠️ Never Log (Security)
- API keys / tokens
- Authorization headers
- Password fields
- Credit card numbers
- PII (email, phone unless masked)
🔧 Debug Only
- Request/response bodies
- Full headers
- DNS resolution details
Log Rotation
logging:
file: "/var/log/proxy/access.log"
maxSize: "100MB"
maxFiles: 10
compress: trueMetrics
Key Metrics
Request Metrics
# Request rate
http_requests_total{method="GET", route="/api/users", status="200"}
# Latency histogram
http_request_duration_ms{route="/api/users", quantile="0.95"}
# Request size
http_request_size_bytes{route="/api/users"}Upstream Metrics
# Upstream response time
upstream_duration_ms{upstream="api.backend.com", quantile="0.99"}
# Upstream errors
upstream_errors_total{upstream="api.backend.com", error="timeout"}
# Connection pool
upstream_connections_active{upstream="api.backend.com"}System Metrics
# Memory usage
process_memory_bytes{type="heapUsed"}
# Event loop lag
event_loop_lag_ms{quantile="0.99"}
# Active connections
tcp_connections_activeBusiness Metrics
# Rate limit hits
rate_limit_exceeded_total{route="/api/users", client="192.168.1.100"}
# Auth failures
auth_failures_total{reason="invalid_key"}
# Circuit breaker trips
circuit_breaker_opened_total{upstream="api.backend.com"}Percentiles We Track
- P50: Typical user experience
- P95: 1 in 20 requests (bad but tolerable)
- P99: 1 in 100 requests (red flag)
- P99.9: Worst-case outliers
Why we care about P99:
- P50/P95 can look great while P99 is terrible
- Users remember the slow requests, not the fast ones
Metrics Exposition
Prometheus format at /metrics:
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",route="/api/users",status="200"} 12543
# HELP http_request_duration_ms Request duration in milliseconds
# TYPE http_request_duration_ms histogram
http_request_duration_ms_bucket{route="/api/users",le="10"} 8234
http_request_duration_ms_bucket{route="/api/users",le="50"} 12000
http_request_duration_ms_bucket{route="/api/users",le="100"} 12500
http_request_duration_ms_sum{route="/api/users"} 567890
http_request_duration_ms_count{route="/api/users"} 12543Health Checks
Liveness Probe
Endpoint: GET /health/live
Purpose: Is the process running?
{
"status": "UP",
"timestamp": "2026-01-26T10:30:45.123Z"
}Returns:
200if server is alive503if shutting down
Use: Kubernetes liveness probe (restart if failing)
Readiness Probe
Endpoint: GET /health/ready
Purpose: Can it handle traffic?
{
"status": "UP",
"checks": {
"config": "UP",
"upstreams": "UP",
"rateLimiter": "UP"
},
"timestamp": "2026-01-26T10:30:45.123Z"
}Returns:
200if ready to serve traffic503if not ready (still starting, upstreams down, etc.)
Use: Kubernetes readiness probe (remove from load balancer if failing)
Deep Health Check
Endpoint: GET /health/deep (internal only)
Purpose: Detailed diagnostics
{
"status": "DEGRADED",
"uptime": 86400,
"checks": {
"config": {
"status": "UP",
"lastReload": "2026-01-26T08:00:00Z"
},
"upstreams": {
"status": "DEGRADED",
"details": {
"api.backend.com": {
"status": "UP",
"latency": 45
},
"legacy.backend.com": {
"status": "DOWN",
"error": "Connection timeout"
}
}
},
"memory": {
"status": "UP",
"heapUsed": 156782592,
"heapTotal": 268435456,
"percentUsed": 58.4
},
"eventLoop": {
"status": "UP",
"lag": 12
}
}
}Alerts
What We Alert On
🚨 Critical (Page On-Call)
- name: "Proxy Down"
condition: up == 0
duration: 1m
severity: critical
- name: "High Error Rate"
condition: rate(http_requests_total{status=~"5.."}[5m]) > 0.05 # >5% errors
duration: 2m
severity: critical
- name: "Upstream Completely Down"
condition: upstream_errors_total{error="connection_refused"} > 100
duration: 1m
severity: critical⚠️ Warning (Investigate During Business Hours)
- name: "High Latency"
condition: http_request_duration_ms{quantile="0.99"} > 1000 # P99 > 1s
duration: 5m
severity: warning
- name: "Memory Usage High"
condition: process_memory_bytes / node_memory_total > 0.8 # >80% memory
duration: 5m
severity: warning
- name: "Rate Limit Hit Frequently"
condition: rate(rate_limit_exceeded_total[5m]) > 10
duration: 5m
severity: warning📊 Info (Track, Don't Alert)
- Circuit breaker opened/closed
- Config reloaded
- Upstream degraded (but not down)
- Event loop lag spikes
What We DON'T Alert On
❌ Individual Request Failures
Why: Normal noise. Alert on error rate, not individual errors.
❌ Low Traffic
Why: Might be expected (off-hours, maintenance window).
❌ Slow Upstream
Why: We track it, but alert on circuit breaker instead.
Dashboards
Overview Dashboard
- RPS: Requests per second (by route, status)
- Latency: P50, P95, P99 (proxy + upstream)
- Error Rate: % of 5xx responses
- Active Connections: Current TCP connections
Upstream Health Dashboard
- Upstream Latency: P95/P99 per upstream
- Error Rate: By upstream and error type
- Circuit Breaker: State per upstream
- Connection Pool: Active/idle connections
Security Dashboard
- Auth Failures: By client IP
- Rate Limit Hits: By route and client
- Blocked Requests: SSRF, large payloads, etc.
Debugging Playbook
Symptom: High P99 Latency
- Check upstream latency metrics → Is it the backend?
- Check event loop lag → Node.js blocking?
- Check GC pauses → Memory pressure?
- Check request size → Large payloads?
- Enable debug logging for slow requests
Symptom: 5xx Errors
- Get correlation ID from logs
- Check upstream status → Down? Timeout?
- Check circuit breaker state → Open?
- Check auth failures → Invalid keys?
- Review error logs for stack traces
Symptom: Memory Leak
- Check heap size over time
- Enable heap dump on threshold
- Analyze with Chrome DevTools
- Look for connection leaks (unclosed sockets)
Log Aggregation
Recommended Stack
- Fluentd/Vector: Log shipping
- Elasticsearch: Storage + search
- Kibana: Visualization
- Grafana: Metrics + logs in one place
Query Examples
Find all errors for a user:
level:error AND metadata.userId:"user-456"Find slow requests:
duration:>1000 AND event:"request.completed"Find rate limit hits:
event:"rate_limit.exceeded" AND http.path:"/api/users"Cost Optimization
Log Storage Costs
At 5K req/sec with 10% sampling:
- 500 requests/sec logged
- ~1KB per log entry
- = 500 KB/sec = 1.3 TB/month
Mitigations:
- Sampling (already doing)
- Retention policy (7 days hot, 30 days cold, 90 days archived)
- Compress old logs (gzip → 10:1 compression)
Metrics Costs
With 50 unique metrics × 10 labels each:
- ~500 time series
- 15s scrape interval
- = 2.88M samples/day
Mitigations:
- Limit cardinality (no user ID in labels!)
- Aggregate low-value metrics (e.g., per-route → per-service)
- Retention: 15 days full resolution, 90 days downsampled
Future Enhancements
- Distributed Tracing: OpenTelemetry integration
- Real User Monitoring: Client-side latency tracking
- SLO Tracking: Error budget dashboards
- Anomaly Detection: ML-based alerting
- Cost Attribution: Track cost per route/client