Skip to content

Monitoring & Metrics

Real-time monitoring and performance metrics through the FlexGate Admin UI.

Monitoring Dashboard

Access comprehensive monitoring at http://localhost:3000/admin/monitoring

Dashboard Layout

The monitoring dashboard consists of:

Top Section:

  • Time range selector
  • Auto-refresh toggle
  • Export data button
  • Alert configuration

Main Panels:

  • System Overview (top)
  • Traffic Metrics (left)
  • Performance Metrics (right)
  • Route Breakdown (bottom)

Time Range Selection

Control the data time window:

Quick Ranges:

  • Last 5 minutes
  • Last 15 minutes
  • Last hour
  • Last 24 hours
  • Last 7 days
  • Last 30 days

Custom Range:

  1. Click "Custom"
  2. Select start date/time
  3. Select end date/time
  4. Click "Apply"

Auto-Refresh:

  • Toggle on for live data
  • Refresh intervals: 5s, 10s, 30s, 1m, 5m
  • Pause/resume anytime

System Overview

High-level system metrics:

Current Status Cards

Requests per Second (RPS)

┌─────────────────────┐
│  2,456 req/sec      │
│  ▲ +12.5% vs 1h ago │
└─────────────────────┘

Average Response Time

┌─────────────────────┐
│  45ms average       │
│  ▼ -8% vs 1h ago    │
└─────────────────────┘

Error Rate

┌─────────────────────┐
│  0.8% errors        │
│  ▲ +0.3% vs 1h ago  │
└─────────────────────┘

Active Connections

┌─────────────────────┐
│  156 active         │
│  ━ No change        │
└─────────────────────┘

System Health Indicators

Health Status:

  • 🟢 Healthy - All systems operational
  • 🟡 Degraded - Some issues detected
  • 🟠 Warning - Critical issues
  • 🔴 Critical - System down

Component Status:

ComponentStatusDetails
FlexGate🟢 HealthyAll instances running
PostgreSQL🟢 HealthyConnected, 45ms latency
Redis🟢 HealthyConnected, 2ms latency
HAProxy🟢 HealthyAll backends up
Prometheus🟢 HealthyScraping metrics

Traffic Metrics

Request Volume

Request Rate Chart (Line)

Shows requests per second over time:

req/sec
3000 │                    ╭─╮
2500 │                ╭───╯ ╰╮
2000 │           ╭────╯      ╰─╮
1500 │      ╭────╯             ╰───
1000 │  ╭───╯
 500 │──╯
     └─────────────────────────────
     10:00  11:00  12:00  13:00

Features:

  • Hover for exact values
  • Click to zoom in
  • Drag to select range
  • Multiple series (per route)

Breakdown by:

  • Total requests
  • Successful (2xx)
  • Client errors (4xx)
  • Server errors (5xx)
  • Redirects (3xx)

Status Code Distribution

Pie Chart:

     2xx (94%)
    ╱────────╲
   │          │  4xx (4%)
   │    ●     │──────
   │          │
    ╲────────╱
     5xx (2%)

Status Code Table:

CodeCount%Trend
200 OK45,23492.1%▼ -1.2%
201 Created1,0232.1%▲ +5.3%
400 Bad Request9872.0%▲ +0.8%
404 Not Found7561.5%━ 0%
429 Rate Limited4560.9%▲ +12%
500 Server Error2340.5%▼ -2.1%
502 Bad Gateway1230.3%▲ +1.5%
503 Unavailable870.2%▼ -0.5%

Throughput

Data Transfer:

  • Inbound: 125 MB/s
  • Outbound: 342 MB/s
  • Total: 467 MB/s

Bandwidth Chart:

MB/s
400 │         Outbound ─────────
300 │                   ╱╲  ╱╲
200 │                  ╱  ╲╱  ╰─
100 │ Inbound ─────────────────
  0 └─────────────────────────

Top Routes by Traffic

Bar Chart:

RouteRequests% of Total
/api/v1/users/*25,45642%
/api/v1/posts/*12,34520%
/api/v1/search8,23414%
/api/v1/auth/*5,6789%
/api/v1/media/*4,5678%
Other4,3207%

Performance Metrics

Response Time

Latency Percentiles:

ms
500 │                         p99
400 │                    ╱────
300 │              ╱────  p95
200 │        ╱─────  p90
100 │  ─────  p50 (median)
  0 └─────────────────────────

Current Values:

MetricValueTargetStatus
p50 (median)25ms<50ms🟢 Good
p9078ms<100ms🟢 Good
p95123ms<200ms🟢 Good
p99287ms<500ms🟢 Good
Max1,234ms<2000ms🟡 Warning

Response Time Distribution (Histogram):

Count
5000 │ ████
4000 │ ████
3000 │ ████ ████
2000 │ ████ ████ ████
1000 │ ████ ████ ████ ████
   0 └─────────────────────
     0-50 50-100 100-200 200+
           Response time (ms)

Request Duration Breakdown

Where time is spent:

Total Request Time: 125ms
├─ DNS Lookup: 5ms (4%)
├─ TCP Connect: 8ms (6%)
├─ TLS Handshake: 12ms (10%)
├─ Request Send: 2ms (2%)
├─ Wait (TTFB): 78ms (62%)  ← Upstream processing
└─ Response Download: 20ms (16%)

Slow Queries

Recent slow requests:

TimeMethodPathDurationStatus
13:45:23GET/api/v1/users/search1,234ms200
13:44:15POST/api/v1/posts987ms201
13:42:08GET/api/v1/analytics856ms200

Click row to see:

  • Full request details
  • Response headers
  • Tracing information
  • Related logs

Upstream Performance

Performance per upstream server:

Upstream: user-service:8080

  • Avg Response: 45ms
  • Error Rate: 0.5%
  • Health: 🟢 Healthy
  • Active Connections: 23

Upstream: post-service:8080

  • Avg Response: 67ms
  • Error Rate: 1.2%
  • Health: 🟢 Healthy
  • Active Connections: 18

Upstream: auth-service:8080

  • Avg Response: 123ms
  • Error Rate: 0.8%
  • Health: 🟡 Degraded
  • Active Connections: 12

Route-Specific Metrics

Select Route

Drill down into specific route metrics:

  1. Click "Select Route" dropdown
  2. Search or browse routes
  3. Select route
  4. View detailed metrics

Route Dashboard

Per-route metrics include:

Traffic Panel:

  • Total requests
  • Request rate
  • Success rate
  • Error breakdown

Performance Panel:

  • Response times (p50, p90, p95, p99)
  • Throughput
  • Connection stats

Health Panel:

  • Upstream health status
  • Circuit breaker state
  • Health check results

Rate Limiting Panel:

  • Rate limit hits
  • Blocked requests
  • Top blocked IPs

Caching Panel:

  • Cache hit rate
  • Cache size
  • Cached responses

Route Comparison

Compare multiple routes:

  1. Click "Compare Routes"
  2. Select up to 5 routes
  3. View side-by-side metrics

Comparison Table:

MetricRoute ARoute BRoute C
RPS1,234567890
Avg Response45ms67ms123ms
Error Rate0.5%1.2%0.8%
Cache Hit Rate87%45%92%

Error Analysis

Track errors over time:

Error %
5.0 │                     ╭─╮
4.0 │                  ╭──╯ ╰╮
3.0 │               ╭──╯     ╰─
2.0 │            ╭──╯
1.0 │      ╭─────╯
0.0 └──────────────────────────

Error Types

Breakdown by category:

TypeCount%Impact
Client Errors (4xx)1,2342.5%🟡 Medium
Server Errors (5xx)4560.9%🔴 High
Timeout Errors1230.2%🔴 High
Circuit Breaker890.2%🟠 Medium
Rate Limit5671.1%🟢 Low

Error Details Table

Recent errors with full context:

TimeRouteErrorStatusMessage
13:45/api/usersServer Error500Internal server error
13:44/api/postsBad Gateway502Upstream unreachable
13:43/api/authTimeout504Request timeout

Click error for:

  • Full stack trace
  • Request/response details
  • Correlated logs
  • Similar errors

Error Rate Alerts

Configure alerts for error spikes:

Alert Rules:

  1. Error rate > 5% for 5 minutes
  2. 5xx errors > 1% for 2 minutes
  3. Specific route error rate > 10%

Notification:

  • Email
  • Slack
  • PagerDuty
  • Webhook

Circuit Breaker Status

Circuit Breaker Dashboard

Monitor all circuit breakers:

Circuit States:

RouteStateFailuresLast TripNext Check
/api/users🟢 CLOSED0Never-
/api/posts🟡 HALF_OPEN210m agoTesting
/api/legacy🔴 OPEN1565m ago25m

State Definitions:

  • 🟢 CLOSED - Normal operation
  • 🟡 HALF_OPEN - Testing recovery
  • 🔴 OPEN - Failing fast

Circuit Breaker Events

Timeline of circuit breaker state changes:

13:45 [api/legacy] CLOSED → OPEN (50% errors, 156 failures)
13:35 [api/posts] OPEN → HALF_OPEN (timeout expired)
13:30 [api/posts] CLOSED → OPEN (timeout errors)

Manual Circuit Breaker Control

Actions:

  • Reset - Force circuit to CLOSED
  • Trip - Force circuit to OPEN
  • Configure - Adjust thresholds

Rate Limiting Metrics

Rate Limit Dashboard

Track rate limiting effectiveness:

Overall Stats:

  • Total rate limited requests: 2,345
  • Top limited IPs: 45
  • Most limited routes: 8

Rate Limit Chart:

Blocked Requests
500 │        ╭───╮
400 │     ╭──╯   ╰─╮
300 │  ╭──╯        ╰──╮
200 │──╯              ╰─
100 │
    └────────────────────

Top Blocked IPs

IPs hitting rate limits:

IP AddressBlocksRouteLast Blocked
192.168.1.100234/api/search2m ago
10.0.0.45189/api/users5m ago
172.16.0.12156/api/posts8m ago

Actions:

  • View IP details
  • Temporary ban
  • Whitelist
  • Investigate

Rate Limit Effectiveness

Measure protection quality:

Metrics:

  • Legitimate traffic allowed: 98.5%
  • Malicious traffic blocked: 1.5%
  • False positives: 0.2%
  • False negatives: Estimated <0.1%

Cache Performance

Cache Hit Rate

Track caching effectiveness:

Hit Rate %
100 │ ────────────────────
 80 │     ╱────────╲
 60 │    ╱          ╲
 40 │ ───              ────
 20 │
  0 └──────────────────────

Current Stats:

  • Hit Rate: 87%
  • Hits: 45,234
  • Misses: 6,789
  • Cache Size: 245 MB
  • Evictions: 123

Cache by Route

Per-route cache performance:

RouteHit RateHitsSizeTTL
/api/posts92%12,34589 MB5m
/api/users85%23,456123 MB10m
/api/search45%8,90133 MB1m

Cache Operations

Actions:

  • Clear cache (all or per route)
  • Invalidate specific entries
  • Warm cache
  • Adjust TTL

Live Traffic View

Real-time Request Stream

Watch requests as they happen:

🟢 13:45:23.123 GET  /api/users/123    200  45ms  192.168.1.100
🟢 13:45:23.234 POST /api/posts        201  123ms 10.0.0.5
🔴 13:45:23.345 GET  /api/legacy/data  502  5000ms 172.16.0.8
🟡 13:45:23.456 GET  /api/search       429  2ms   192.168.1.100
🟢 13:45:23.567 GET  /api/health       200  1ms   127.0.0.1

Color Coding:

  • 🟢 Success (2xx, 3xx)
  • 🟡 Client Error (4xx)
  • 🔴 Server Error (5xx)

Features:

  • Pause/resume stream
  • Filter by status, route, IP
  • Search in real-time
  • Export visible logs

Request Details

Click any request to see:

Request Tab:

  • Method, path, query
  • Headers
  • Body (if any)
  • Client IP, user agent

Response Tab:

  • Status code
  • Headers
  • Body (formatted)
  • Size

Timing Tab:

  • DNS lookup
  • Connection time
  • TLS handshake
  • Request sent
  • TTFB
  • Download time
  • Total time

Trace Tab:

  • Request ID
  • Parent span
  • Child spans
  • Distributed trace

Custom Dashboards

Create Dashboard

Build custom monitoring views:

  1. Click "New Dashboard"
  2. Name dashboard
  3. Add widgets:
    • Metrics chart
    • Status card
    • Table
    • Gauge
    • Heatmap

Widget Configuration

Chart Widget:

  • Select metric(s)
  • Choose chart type (line, bar, pie)
  • Set time range
  • Configure thresholds

Example Widgets:

  • "API Error Rate by Service"
  • "Top 10 Slowest Endpoints"
  • "Cache Hit Rate Trend"
  • "Rate Limit Blocks by IP"

Share Dashboards

  • Save dashboard
  • Share URL
  • Export dashboard JSON
  • Import dashboard

Alerts & Notifications

Alert Configuration

Set up intelligent alerts:

Alert Types:

  1. Threshold Alert

    • Metric exceeds value
    • Example: Error rate > 5%
  2. Anomaly Detection

    • ML-based anomaly detection
    • Example: Unusual traffic pattern
  3. Composite Alert

    • Multiple conditions
    • Example: High errors AND slow response

Alert Rules

Create Alert Rule:

yaml
name: High Error Rate
condition: error_rate > 5%
duration: 5m
severity: warning
notifications:
  - email: ops@example.com
  - slack: #alerts
throttle: 15m

Alert States:

  • 🟢 OK - Condition not met
  • 🟡 Warning - Threshold approaching
  • 🔴 Critical - Condition met
  • 🔵 Acknowledged - Alert seen

Notification Channels

Configure where alerts go:

Email:

  • Recipients
  • Subject template
  • Body template

Slack:

  • Workspace
  • Channel
  • Mention @user or @channel

Webhook:

  • URL
  • HTTP method
  • Headers
  • Payload template

PagerDuty:

  • Service key
  • Severity mapping
  • Auto-resolve

Alert History

View past alerts:

TimeAlertSeverityDurationStatus
13:45High Error Rate🔴 Critical5mResolved
13:30Slow Response🟡 Warning12mResolved
12:15Circuit Breaker🔴 Critical2mResolved

Data Export

Export Options

Export monitoring data for analysis:

Format:

  • CSV
  • JSON
  • Parquet
  • Excel

Data:

  • Metrics (time series)
  • Logs
  • Traces
  • Alerts

Time Range:

  • Current view
  • Custom range
  • Last N days

Export Process

  1. Click "Export Data"
  2. Select data type
  3. Choose format
  4. Select time range
  5. Click "Download"

Example CSV:

csv
timestamp,route,method,status,response_time_ms
2026-02-09T13:45:23Z,/api/users,GET,200,45
2026-02-09T13:45:24Z,/api/posts,POST,201,123

Best Practices

Monitoring Strategy

  1. Set Baselines

    • Establish normal ranges
    • Monitor for deviations
    • Adjust thresholds
  2. Alert Fatigue

    • Avoid too many alerts
    • Use severity levels
    • Aggregate related alerts
    • Set throttling
  3. Dashboard Organization

    • Overview dashboard (high-level)
    • Service dashboards (per team)
    • Incident dashboard (on-call)
  4. Data Retention

    • Real-time: 24 hours
    • High resolution: 7 days
    • Aggregated: 90 days
    • Long-term: 1 year (sampled)

Performance Monitoring

Monitor these key metrics:

Golden Signals:

  1. Latency (response time)
  2. Traffic (request rate)
  3. Errors (error rate)
  4. Saturation (resource usage)

SLIs (Service Level Indicators):

  • Availability: 99.9%
  • Latency p99: <500ms
  • Error rate: <1%
  • Throughput: >10k req/sec

Keyboard Shortcuts

ShortcutAction
RRefresh data
PPause auto-refresh
TChange time range
EExport data
FToggle fullscreen
LOpen live traffic
SSearch/filter

Next Steps


Need help? Join our GitHub Discussions.

Released under the MIT License.