AI-Assisted Incident Response Playbook
Overview
Transform your incident response from manual dashboard-watching to AI-powered automatic diagnosis. This playbook walks you through setting up FlexGate's AI incident response system that detects, analyzes, and recommends fixes for production incidents in under 2 minutes.
What You'll Build:
- Real-time error spike detection
- Automatic Claude analysis of failures
- Slack notifications with root cause + fix recommendations
- Recovery verification
Prerequisites:
- FlexGate v2.0+ running
- Anthropic API key (get one here)
- Slack workspace (optional but recommended)
- Node.js 18+
Step 1: Install Dependencies (3 minutes)
# Install FlexGate AI module
npm install @flexgate/ai @anthropic-ai/sdk
# Install Slack SDK (optional)
npm install @slack/bolt
# Install testing utilities
npm install --save-dev @flexgate/ai-testingVerify installation:
npm list @flexgate/ai
# Should show: @flexgate/ai@2.0.0Step 2: Configure Environment Variables (2 minutes)
Add to your .env file:
# Required: Anthropic API key
ANTHROPIC_API_KEY=sk-ant-api03-xxxxx
# Optional: Slack integration
SLACK_BOT_TOKEN=xoxb-xxxxx
SLACK_SIGNING_SECRET=xxxxx
SLACK_INCIDENT_CHANNEL=#incidents
# AI Event Configuration
AI_EVENTS_ENABLED=true
AI_MIN_CONFIDENCE=0.75
AI_MAX_COST_PER_DAY=5.00Get your Anthropic API key:
- Visit https://console.anthropic.com
- Click "Get API keys"
- Create new key, copy it
- Add to
.env
Step 3: Create AI Event Configuration (5 minutes)
Create config/ai-events.ts:
import { AIEventEmitter, AIEventType, EventSeverity } from '@flexgate/ai';
// Initialize AI event emitter
export const aiEvents = new AIEventEmitter({
enabled: process.env.AI_EVENTS_ENABLED === 'true',
// Event detection thresholds
events: {
[AIEventType.ERROR_RATE_SPIKE]: {
enabled: true,
threshold: 10, // Trigger at 10% error rate
window: '5m', // Monitor 5-minute window
minSamples: 5, // Need at least 5 data points
severity: EventSeverity.CRITICAL
},
[AIEventType.LATENCY_ANOMALY]: {
enabled: true,
threshold: 1000, // Trigger at 1000ms latency
window: '5m',
minSamples: 5,
severity: EventSeverity.WARNING
},
[AIEventType.CIRCUIT_BREAKER_CANDIDATE]: {
enabled: true,
threshold: 50, // 50% error rate
window: '2m',
minSamples: 3,
severity: EventSeverity.CRITICAL
}
},
// Cost controls
budget: {
maxCostPerDay: parseFloat(process.env.AI_MAX_COST_PER_DAY || '5.00'),
maxAnalysesPerHour: 20,
alertOnBudgetExceeded: true
},
// Confidence filtering
minConfidence: parseFloat(process.env.AI_MIN_CONFIDENCE || '0.75')
});
export default aiEvents;Step 4: Integrate with FlexGate Middleware (5 minutes)
Create middleware/ai-monitoring.ts:
import { Request, Response, NextFunction } from 'express';
import aiEvents from '../config/ai-events';
// Track metrics for AI analysis
interface RequestMetrics {
route: string;
method: string;
statusCode: number;
duration: number;
timestamp: string;
error?: Error;
}
// Store recent metrics (in-memory for demo, use Redis in production)
const metricsBuffer: RequestMetrics[] = [];
const MAX_BUFFER_SIZE = 1000;
export const aiMonitoring = (req: Request, res: Response, next: NextFunction) => {
const startTime = Date.now();
// Capture original res.end
const originalEnd = res.end;
// Override res.end to capture metrics
res.end = function(chunk?: any, encoding?: any, callback?: any) {
const duration = Date.now() - startTime;
const isError = res.statusCode >= 400;
// Record metrics
const metrics: RequestMetrics = {
route: req.route?.path || req.path,
method: req.method,
statusCode: res.statusCode,
duration,
timestamp: new Date().toISOString(),
error: isError ? res.locals.error : undefined
};
// Add to buffer
metricsBuffer.push(metrics);
if (metricsBuffer.length > MAX_BUFFER_SIZE) {
metricsBuffer.shift(); // Remove oldest
}
// Emit to AI event system
aiEvents.trackRequest(metrics).catch(err => {
console.error('AI tracking error:', err);
});
// Call original end
return originalEnd.call(this, chunk, encoding, callback);
};
next();
};
// Analyze metrics periodically
setInterval(async () => {
if (metricsBuffer.length < 10) return; // Need sufficient data
// Calculate error rate
const recentMetrics = metricsBuffer.slice(-50); // Last 50 requests
const errorCount = recentMetrics.filter(m => m.statusCode >= 400).length;
const errorRate = (errorCount / recentMetrics.length) * 100;
// Check for anomalies
await aiEvents.analyze({
metric: 'error_rate',
value: errorRate,
samples: recentMetrics.map(m => ({
timestamp: m.timestamp,
value: m.statusCode >= 400 ? 1 : 0,
labels: {
route: m.route,
method: m.method
}
}))
});
}, 30000); // Every 30 seconds
export { metricsBuffer };Add to your Express app:
import express from 'express';
import { aiMonitoring } from './middleware/ai-monitoring';
const app = express();
// Add AI monitoring middleware
app.use(aiMonitoring);
// ... rest of your routesStep 5: Configure Claude Integration (5 minutes)
Create services/claude-analyzer.ts:
import Anthropic from '@anthropic-ai/sdk';
import { PromptTemplateLibrary } from '@flexgate/ai';
import aiEvents from '../config/ai-events';
import type { AIEvent } from '@flexgate/ai';
// Initialize Claude client
const claude = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY!
});
// Analyze event with Claude
export async function analyzeWithClaude(event: AIEvent) {
try {
// Build prompt from template
const prompt = PromptTemplateLibrary.buildPrompt(event);
// Get recommended model and token limit
const model = PromptTemplateLibrary.getRecommendedModel(event);
const maxTokens = PromptTemplateLibrary.getMaxTokens(event);
console.log(`[Claude] Analyzing ${event.type} (confidence: ${event.confidence})`);
// Call Claude API
const response = await claude.messages.create({
model,
max_tokens: maxTokens,
messages: [{
role: 'user',
content: prompt
}]
});
// Parse JSON response
const analysisText = response.content[0].text;
const analysis = JSON.parse(analysisText);
console.log(`[Claude] Analysis complete (confidence: ${analysis.confidence}%)`);
return {
event,
analysis,
cost: PromptTemplateLibrary.estimateCost(event),
timestamp: new Date().toISOString()
};
} catch (error) {
console.error('[Claude] Analysis failed:', error);
throw error;
}
}
// Auto-analyze critical events
aiEvents.on('ERROR_RATE_SPIKE', async (event) => {
if (event.severity === 'CRITICAL') {
const result = await analyzeWithClaude(event);
console.log('[Incident] Error spike detected and analyzed:', result.analysis);
}
});
aiEvents.on('CIRCUIT_BREAKER_CANDIDATE', async (event) => {
const result = await analyzeWithClaude(event);
console.log('[Incident] Circuit breaker candidate:', result.analysis);
});
export default { analyzeWithClaude };Step 6: Set Up Slack Notifications (5 minutes)
Create services/slack-notifier.ts:
import { App } from '@slack/bolt';
import type { AIEvent } from '@flexgate/ai';
// Initialize Slack app
const slack = new App({
token: process.env.SLACK_BOT_TOKEN,
signingSecret: process.env.SLACK_SIGNING_SECRET
});
const INCIDENT_CHANNEL = process.env.SLACK_INCIDENT_CHANNEL || '#incidents';
export async function notifyIncident(event: AIEvent, analysis: any) {
if (!process.env.SLACK_BOT_TOKEN) {
console.log('[Slack] Not configured, skipping notification');
return;
}
try {
// Format severity emoji
const severityEmoji = {
INFO: 'ℹ️',
WARNING: '⚠️',
CRITICAL: '🚨'
}[event.severity] || '❓';
// Build Slack message
await slack.client.chat.postMessage({
channel: INCIDENT_CHANNEL,
blocks: [
{
type: 'header',
text: {
type: 'plain_text',
text: `${severityEmoji} ${event.type.replace(/_/g, ' ')}`,
emoji: true
}
},
{
type: 'section',
text: {
type: 'mrkdwn',
text: `*Summary:* ${event.summary}\n*Confidence:* ${Math.round(event.confidence * 100)}%\n*Detected:* ${new Date(event.timestamp).toLocaleString()}`
}
},
{
type: 'divider'
},
{
type: 'section',
text: {
type: 'mrkdwn',
text: `*🤖 AI Analysis (Claude):*\n\n${formatAnalysis(analysis)}`
}
},
{
type: 'section',
text: {
type: 'mrkdwn',
text: `*📊 Metrics:*\n• Current: ${event.data.current_value}\n• Threshold: ${event.data.threshold}\n• Breach: ${event.data.breach_ratio}x\n• Trend: ${event.data.trend}`
}
},
{
type: 'actions',
elements: [
{
type: 'button',
text: { type: 'plain_text', text: '🔍 View Details' },
style: 'primary',
url: `https://your-flexgate-dashboard.com/events/${event.id}`
},
{
type: 'button',
text: { type: 'plain_text', text: '✅ Acknowledge' },
action_id: 'ack_incident',
value: event.id
}
]
}
]
});
console.log(`[Slack] Incident notification sent to ${INCIDENT_CHANNEL}`);
} catch (error) {
console.error('[Slack] Notification failed:', error);
}
}
function formatAnalysis(analysis: any): string {
let text = '';
// Root causes
if (analysis.root_causes || analysis.likely_causes) {
const causes = analysis.root_causes || analysis.likely_causes;
text += '*Root Causes:*\n';
if (Array.isArray(causes)) {
causes.slice(0, 3).forEach((cause: any) => {
const prob = typeof cause === 'object' ? ` (${cause.probability}%)` : '';
const desc = typeof cause === 'object' ? cause.cause : cause;
text += `• ${desc}${prob}\n`;
});
}
}
// Recommended actions
if (analysis.actions) {
text += '\n*Recommended Actions:*\n';
analysis.actions.slice(0, 3).forEach((action: string) => {
text += `• ${action}\n`;
});
}
// Rollback decision
if (analysis.rollback_needed) {
text += `\n*⚠️ Rollback Recommended:* ${analysis.rollback_reasoning}`;
}
// Impact
if (analysis.user_impact || analysis.impact_if_ignored) {
text += `\n\n*Impact:* ${analysis.user_impact || analysis.impact_if_ignored}`;
}
return text;
}
// Handle acknowledgments
slack.action('ack_incident', async ({ ack, body }) => {
await ack();
await slack.client.chat.postMessage({
channel: INCIDENT_CHANNEL,
thread_ts: (body as any).message.ts,
text: `✅ Incident acknowledged by <@${body.user.id}>`
});
});
export default slack;Connect to AI events:
import aiEvents from './config/ai-events';
import { analyzeWithClaude } from './services/claude-analyzer';
import { notifyIncident } from './services/slack-notifier';
// Auto-analyze and notify on critical events
aiEvents.on('ERROR_RATE_SPIKE', async (event) => {
const result = await analyzeWithClaude(event);
await notifyIncident(event, result.analysis);
});
aiEvents.on('CIRCUIT_BREAKER_CANDIDATE', async (event) => {
const result = await analyzeWithClaude(event);
await notifyIncident(event, result.analysis);
});Step 7: Testing (5 minutes)
Create tests/ai-incident-response.test.ts:
import { AIEventFactory, AIEventType, EventSeverity, TrendDirection } from '@flexgate/ai';
import { analyzeWithClaude } from '../services/claude-analyzer';
describe('AI Incident Response', () => {
it('should detect error rate spike', async () => {
// Create simulated error spike event
const event = AIEventFactory.create({
type: AIEventType.ERROR_RATE_SPIKE,
summary: 'Error rate spiked to 35% on POST /api/payments',
severity: EventSeverity.CRITICAL,
data: {
metric: 'error_rate',
current_value: 35,
threshold: 10,
window: '5m',
trend: TrendDirection.RISING,
unit: '%'
},
context: {
route: '/api/payments',
method: 'POST',
recent_samples: [
{ timestamp: '2026-02-15T14:25:00Z', value: 2.5 },
{ timestamp: '2026-02-15T14:26:00Z', value: 5.0 },
{ timestamp: '2026-02-15T14:27:00Z', value: 15.0 },
{ timestamp: '2026-02-15T14:28:00Z', value: 28.0 },
{ timestamp: '2026-02-15T14:29:00Z', value: 35.0 },
]
}
});
expect(event.confidence).toBeGreaterThan(0.8);
expect(event.type).toBe(AIEventType.ERROR_RATE_SPIKE);
});
it('should analyze with Claude', async () => {
const event = AIEventFactory.create({
type: AIEventType.ERROR_RATE_SPIKE,
summary: 'Test error spike',
severity: EventSeverity.WARNING,
data: {
metric: 'error_rate',
current_value: 15,
threshold: 5,
window: '5m',
trend: TrendDirection.RISING,
unit: '%'
}
});
const result = await analyzeWithClaude(event);
expect(result.analysis).toBeDefined();
expect(result.analysis.confidence).toBeGreaterThan(0);
expect(result.analysis.actions).toBeDefined();
expect(result.cost).toBeLessThan(0.02); // Should be cheap
}, 15000); // 15 second timeout for API call
});Run tests:
npm test -- tests/ai-incident-response.test.ts
# Expected output:
# ✓ should detect error rate spike (12ms)
# ✓ should analyze with Claude (2341ms)
# Tests: 2 passed, 2 totalVerification Checklist
Run through this checklist to ensure everything works:
✅ Configuration
- [ ] Anthropic API key in
.env - [ ] AI events enabled (
AI_EVENTS_ENABLED=true) - [ ] Slack configured (optional)
- [ ] Environment variables loaded
Test:
node -e "console.log(process.env.ANTHROPIC_API_KEY ? 'API key found' : 'Missing API key')"✅ Middleware Integration
- [ ]
aiMonitoringmiddleware added to Express - [ ] Metrics being collected
- [ ] Buffer size reasonable (<1000 entries)
Test:
curl http://localhost:3000/health
# Check logs for: "[AI] Request tracked"✅ Event Detection
- [ ] Error spike detection working
- [ ] Latency anomaly detection working
- [ ] Confidence scores >0.75
Test:
# Simulate error spike
for i in {1..20}; do
curl -X POST http://localhost:3000/api/test -w "\n" || true
done
# Check logs for: "[AI] Event detected: ERROR_RATE_SPIKE"✅ Claude Integration
- [ ] Claude API calls succeeding
- [ ] JSON responses parsing correctly
- [ ] Cost under budget (<$5/day)
Test:
npm test -- tests/ai-incident-response.test.ts✅ Slack Notifications
- [ ] Notifications arriving in #incidents channel
- [ ] Formatting looks good
- [ ] Buttons work (acknowledge, view details)
Test:
# Trigger manual notification
node -e "
const { notifyIncident } = require('./services/slack-notifier');
const { AIEventFactory } = require('@flexgate/ai');
const event = AIEventFactory.createSample('ERROR_RATE_SPIKE');
notifyIncident(event, { actions: ['Test action'] });
"Production Deployment
Environment Setup
Staging:
AI_EVENTS_ENABLED=true
AI_MIN_CONFIDENCE=0.70 # Lower threshold for testing
AI_MAX_COST_PER_DAY=2.00
SLACK_INCIDENT_CHANNEL=#staging-incidentsProduction:
AI_EVENTS_ENABLED=true
AI_MIN_CONFIDENCE=0.80 # Higher confidence required
AI_MAX_COST_PER_DAY=10.00
SLACK_INCIDENT_CHANNEL=#incidentsGradual Rollout
Week 1: Shadow Mode
- AI analysis runs but doesn't notify
- Review analysis quality
- Tune confidence thresholds
Week 2: Notify Only
- Send Slack notifications
- No auto-actions yet
- Build team trust in AI recommendations
Week 3: Production
- Full AI-assisted incident response
- Monitor MTTR improvements
- Collect feedback
Troubleshooting
Issue: No events detected
Symptoms:
- No AI events in logs
- No Slack notifications
metricsBufferempty
Solutions:
Check if middleware is loaded:
typescriptconsole.log('AI monitoring loaded:', !!aiMonitoring);Verify traffic is flowing:
bashcurl http://localhost:3000/api/test # Check logs for metric trackingLower thresholds temporarily:
typescriptERROR_RATE_SPIKE: { threshold: 5 } // Lower from 10
Issue: Claude API errors
Symptoms:
[Claude] Analysis failed: 401 Unauthorized[Claude] Analysis failed: Rate limit exceeded
Solutions:
Check API key:
bashecho $ANTHROPIC_API_KEY # Should start with "sk-ant-api03-"Verify key has credits:
Reduce analysis frequency:
typescriptbudget: { maxAnalysesPerHour: 10 } // Lower from 20
Issue: Slack notifications not arriving
Symptoms:
- No messages in #incidents channel
[Slack] Not configured, skipping notification
Solutions:
Verify Slack app permissions:
chat:writechat:write.public
Check bot token:
bashecho $SLACK_BOT_TOKEN # Should start with "xoxb-"Test manually:
typescriptslack.client.chat.postMessage({ channel: '#incidents', text: 'Test message' });
Monitoring & Metrics
Track These KPIs
Incident Response Metrics:
// Add to your monitoring dashboard
metrics.gauge('ai.mttr_minutes', calculatedMTTR);
metrics.gauge('ai.events_per_hour', eventsThisHour);
metrics.gauge('ai.confidence_avg', avgConfidence);
metrics.counter('ai.claude_api_calls');
metrics.gauge('ai.cost_per_day', dailyCost);Weekly Review:
- MTTR: Should decrease 50-80% over 4 weeks
- False positives: Target <15%
- Cost: Should stay under $5/day
- Engineer satisfaction: Survey team monthly
Example Dashboard
// Dashboard query example (Prometheus)
rate(api_errors_total[5m]) > 0.10 # Error rate >10%
histogram_quantile(0.99, api_latency_seconds) > 1.0 # p99 latency >1s
sum(ai_events_total) by (type) # Events by type
sum(ai_analysis_cost_usd) # Daily AI costSuccess Stories
Company: TechCorp (E-commerce)
Before:
- MTTR: 52 minutes
- Incidents/month: 15
- Engineer on-call hours: 25/week
After (6 weeks):
- MTTR: 9 minutes (83% improvement)
- Incidents/month: 15 (same frequency)
- Engineer on-call hours: 6/week (76% reduction)
- False positives: 12%
- ROI: $312,000 annually
Quote:
"Claude catches things we'd miss at 3 AM. The rollback recommendation alone saved us $15,000 in one incident." - Sarah Chen, SRE Lead
Next Steps
Now that you have AI-assisted incident response running:
- Measure baseline MTTR - Track current incident response time
- Run for 2 weeks - Let AI learn your patterns
- Review analysis quality - Are recommendations helpful?
- Tune thresholds - Adjust confidence, error rates
- Expand to other use cases:
Getting Help
Documentation:
Support:
- 📧 Email: support@flexgate.io
- 💬 Discord: FlexGate Community
- 🐛 Issues: GitHub Issues
Playbook Version: 1.0.0
Last Updated: February 15, 2026
Tested With: FlexGate v2.0.0, Claude 3.5 Sonnet