Skip to content

AI-Assisted Incident Response Playbook

Overview

Transform your incident response from manual dashboard-watching to AI-powered automatic diagnosis. This playbook walks you through setting up FlexGate's AI incident response system that detects, analyzes, and recommends fixes for production incidents in under 2 minutes.

What You'll Build:

  • Real-time error spike detection
  • Automatic Claude analysis of failures
  • Slack notifications with root cause + fix recommendations
  • Recovery verification

Prerequisites:

  • FlexGate v2.0+ running
  • Anthropic API key (get one here)
  • Slack workspace (optional but recommended)
  • Node.js 18+

Step 1: Install Dependencies (3 minutes)

bash
# Install FlexGate AI module
npm install @flexgate/ai @anthropic-ai/sdk

# Install Slack SDK (optional)
npm install @slack/bolt

# Install testing utilities
npm install --save-dev @flexgate/ai-testing

Verify installation:

bash
npm list @flexgate/ai
# Should show: @flexgate/ai@2.0.0

Step 2: Configure Environment Variables (2 minutes)

Add to your .env file:

bash
# Required: Anthropic API key
ANTHROPIC_API_KEY=sk-ant-api03-xxxxx

# Optional: Slack integration
SLACK_BOT_TOKEN=xoxb-xxxxx
SLACK_SIGNING_SECRET=xxxxx
SLACK_INCIDENT_CHANNEL=#incidents

# AI Event Configuration
AI_EVENTS_ENABLED=true
AI_MIN_CONFIDENCE=0.75
AI_MAX_COST_PER_DAY=5.00

Get your Anthropic API key:

  1. Visit https://console.anthropic.com
  2. Click "Get API keys"
  3. Create new key, copy it
  4. Add to .env

Step 3: Create AI Event Configuration (5 minutes)

Create config/ai-events.ts:

typescript
import { AIEventEmitter, AIEventType, EventSeverity } from '@flexgate/ai';

// Initialize AI event emitter
export const aiEvents = new AIEventEmitter({
  enabled: process.env.AI_EVENTS_ENABLED === 'true',
  
  // Event detection thresholds
  events: {
    [AIEventType.ERROR_RATE_SPIKE]: {
      enabled: true,
      threshold: 10, // Trigger at 10% error rate
      window: '5m',   // Monitor 5-minute window
      minSamples: 5,  // Need at least 5 data points
      severity: EventSeverity.CRITICAL
    },
    
    [AIEventType.LATENCY_ANOMALY]: {
      enabled: true,
      threshold: 1000, // Trigger at 1000ms latency
      window: '5m',
      minSamples: 5,
      severity: EventSeverity.WARNING
    },
    
    [AIEventType.CIRCUIT_BREAKER_CANDIDATE]: {
      enabled: true,
      threshold: 50, // 50% error rate
      window: '2m',
      minSamples: 3,
      severity: EventSeverity.CRITICAL
    }
  },
  
  // Cost controls
  budget: {
    maxCostPerDay: parseFloat(process.env.AI_MAX_COST_PER_DAY || '5.00'),
    maxAnalysesPerHour: 20,
    alertOnBudgetExceeded: true
  },
  
  // Confidence filtering
  minConfidence: parseFloat(process.env.AI_MIN_CONFIDENCE || '0.75')
});

export default aiEvents;

Step 4: Integrate with FlexGate Middleware (5 minutes)

Create middleware/ai-monitoring.ts:

typescript
import { Request, Response, NextFunction } from 'express';
import aiEvents from '../config/ai-events';

// Track metrics for AI analysis
interface RequestMetrics {
  route: string;
  method: string;
  statusCode: number;
  duration: number;
  timestamp: string;
  error?: Error;
}

// Store recent metrics (in-memory for demo, use Redis in production)
const metricsBuffer: RequestMetrics[] = [];
const MAX_BUFFER_SIZE = 1000;

export const aiMonitoring = (req: Request, res: Response, next: NextFunction) => {
  const startTime = Date.now();
  
  // Capture original res.end
  const originalEnd = res.end;
  
  // Override res.end to capture metrics
  res.end = function(chunk?: any, encoding?: any, callback?: any) {
    const duration = Date.now() - startTime;
    const isError = res.statusCode >= 400;
    
    // Record metrics
    const metrics: RequestMetrics = {
      route: req.route?.path || req.path,
      method: req.method,
      statusCode: res.statusCode,
      duration,
      timestamp: new Date().toISOString(),
      error: isError ? res.locals.error : undefined
    };
    
    // Add to buffer
    metricsBuffer.push(metrics);
    if (metricsBuffer.length > MAX_BUFFER_SIZE) {
      metricsBuffer.shift(); // Remove oldest
    }
    
    // Emit to AI event system
    aiEvents.trackRequest(metrics).catch(err => {
      console.error('AI tracking error:', err);
    });
    
    // Call original end
    return originalEnd.call(this, chunk, encoding, callback);
  };
  
  next();
};

// Analyze metrics periodically
setInterval(async () => {
  if (metricsBuffer.length < 10) return; // Need sufficient data
  
  // Calculate error rate
  const recentMetrics = metricsBuffer.slice(-50); // Last 50 requests
  const errorCount = recentMetrics.filter(m => m.statusCode >= 400).length;
  const errorRate = (errorCount / recentMetrics.length) * 100;
  
  // Check for anomalies
  await aiEvents.analyze({
    metric: 'error_rate',
    value: errorRate,
    samples: recentMetrics.map(m => ({
      timestamp: m.timestamp,
      value: m.statusCode >= 400 ? 1 : 0,
      labels: {
        route: m.route,
        method: m.method
      }
    }))
  });
  
}, 30000); // Every 30 seconds

export { metricsBuffer };

Add to your Express app:

typescript
import express from 'express';
import { aiMonitoring } from './middleware/ai-monitoring';

const app = express();

// Add AI monitoring middleware
app.use(aiMonitoring);

// ... rest of your routes

Step 5: Configure Claude Integration (5 minutes)

Create services/claude-analyzer.ts:

typescript
import Anthropic from '@anthropic-ai/sdk';
import { PromptTemplateLibrary } from '@flexgate/ai';
import aiEvents from '../config/ai-events';
import type { AIEvent } from '@flexgate/ai';

// Initialize Claude client
const claude = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY!
});

// Analyze event with Claude
export async function analyzeWithClaude(event: AIEvent) {
  try {
    // Build prompt from template
    const prompt = PromptTemplateLibrary.buildPrompt(event);
    
    // Get recommended model and token limit
    const model = PromptTemplateLibrary.getRecommendedModel(event);
    const maxTokens = PromptTemplateLibrary.getMaxTokens(event);
    
    console.log(`[Claude] Analyzing ${event.type} (confidence: ${event.confidence})`);
    
    // Call Claude API
    const response = await claude.messages.create({
      model,
      max_tokens: maxTokens,
      messages: [{
        role: 'user',
        content: prompt
      }]
    });
    
    // Parse JSON response
    const analysisText = response.content[0].text;
    const analysis = JSON.parse(analysisText);
    
    console.log(`[Claude] Analysis complete (confidence: ${analysis.confidence}%)`);
    
    return {
      event,
      analysis,
      cost: PromptTemplateLibrary.estimateCost(event),
      timestamp: new Date().toISOString()
    };
    
  } catch (error) {
    console.error('[Claude] Analysis failed:', error);
    throw error;
  }
}

// Auto-analyze critical events
aiEvents.on('ERROR_RATE_SPIKE', async (event) => {
  if (event.severity === 'CRITICAL') {
    const result = await analyzeWithClaude(event);
    console.log('[Incident] Error spike detected and analyzed:', result.analysis);
  }
});

aiEvents.on('CIRCUIT_BREAKER_CANDIDATE', async (event) => {
  const result = await analyzeWithClaude(event);
  console.log('[Incident] Circuit breaker candidate:', result.analysis);
});

export default { analyzeWithClaude };

Step 6: Set Up Slack Notifications (5 minutes)

Create services/slack-notifier.ts:

typescript
import { App } from '@slack/bolt';
import type { AIEvent } from '@flexgate/ai';

// Initialize Slack app
const slack = new App({
  token: process.env.SLACK_BOT_TOKEN,
  signingSecret: process.env.SLACK_SIGNING_SECRET
});

const INCIDENT_CHANNEL = process.env.SLACK_INCIDENT_CHANNEL || '#incidents';

export async function notifyIncident(event: AIEvent, analysis: any) {
  if (!process.env.SLACK_BOT_TOKEN) {
    console.log('[Slack] Not configured, skipping notification');
    return;
  }
  
  try {
    // Format severity emoji
    const severityEmoji = {
      INFO: 'ℹ️',
      WARNING: '⚠️',
      CRITICAL: '🚨'
    }[event.severity] || '❓';
    
    // Build Slack message
    await slack.client.chat.postMessage({
      channel: INCIDENT_CHANNEL,
      blocks: [
        {
          type: 'header',
          text: {
            type: 'plain_text',
            text: `${severityEmoji} ${event.type.replace(/_/g, ' ')}`,
            emoji: true
          }
        },
        {
          type: 'section',
          text: {
            type: 'mrkdwn',
            text: `*Summary:* ${event.summary}\n*Confidence:* ${Math.round(event.confidence * 100)}%\n*Detected:* ${new Date(event.timestamp).toLocaleString()}`
          }
        },
        {
          type: 'divider'
        },
        {
          type: 'section',
          text: {
            type: 'mrkdwn',
            text: `*🤖 AI Analysis (Claude):*\n\n${formatAnalysis(analysis)}`
          }
        },
        {
          type: 'section',
          text: {
            type: 'mrkdwn',
            text: `*📊 Metrics:*\n• Current: ${event.data.current_value}\n• Threshold: ${event.data.threshold}\n• Breach: ${event.data.breach_ratio}x\n• Trend: ${event.data.trend}`
          }
        },
        {
          type: 'actions',
          elements: [
            {
              type: 'button',
              text: { type: 'plain_text', text: '🔍 View Details' },
              style: 'primary',
              url: `https://your-flexgate-dashboard.com/events/${event.id}`
            },
            {
              type: 'button',
              text: { type: 'plain_text', text: '✅ Acknowledge' },
              action_id: 'ack_incident',
              value: event.id
            }
          ]
        }
      ]
    });
    
    console.log(`[Slack] Incident notification sent to ${INCIDENT_CHANNEL}`);
    
  } catch (error) {
    console.error('[Slack] Notification failed:', error);
  }
}

function formatAnalysis(analysis: any): string {
  let text = '';
  
  // Root causes
  if (analysis.root_causes || analysis.likely_causes) {
    const causes = analysis.root_causes || analysis.likely_causes;
    text += '*Root Causes:*\n';
    
    if (Array.isArray(causes)) {
      causes.slice(0, 3).forEach((cause: any) => {
        const prob = typeof cause === 'object' ? ` (${cause.probability}%)` : '';
        const desc = typeof cause === 'object' ? cause.cause : cause;
        text += `• ${desc}${prob}\n`;
      });
    }
  }
  
  // Recommended actions
  if (analysis.actions) {
    text += '\n*Recommended Actions:*\n';
    analysis.actions.slice(0, 3).forEach((action: string) => {
      text += `• ${action}\n`;
    });
  }
  
  // Rollback decision
  if (analysis.rollback_needed) {
    text += `\n*⚠️ Rollback Recommended:* ${analysis.rollback_reasoning}`;
  }
  
  // Impact
  if (analysis.user_impact || analysis.impact_if_ignored) {
    text += `\n\n*Impact:* ${analysis.user_impact || analysis.impact_if_ignored}`;
  }
  
  return text;
}

// Handle acknowledgments
slack.action('ack_incident', async ({ ack, body }) => {
  await ack();
  
  await slack.client.chat.postMessage({
    channel: INCIDENT_CHANNEL,
    thread_ts: (body as any).message.ts,
    text: `✅ Incident acknowledged by <@${body.user.id}>`
  });
});

export default slack;

Connect to AI events:

typescript
import aiEvents from './config/ai-events';
import { analyzeWithClaude } from './services/claude-analyzer';
import { notifyIncident } from './services/slack-notifier';

// Auto-analyze and notify on critical events
aiEvents.on('ERROR_RATE_SPIKE', async (event) => {
  const result = await analyzeWithClaude(event);
  await notifyIncident(event, result.analysis);
});

aiEvents.on('CIRCUIT_BREAKER_CANDIDATE', async (event) => {
  const result = await analyzeWithClaude(event);
  await notifyIncident(event, result.analysis);
});

Step 7: Testing (5 minutes)

Create tests/ai-incident-response.test.ts:

typescript
import { AIEventFactory, AIEventType, EventSeverity, TrendDirection } from '@flexgate/ai';
import { analyzeWithClaude } from '../services/claude-analyzer';

describe('AI Incident Response', () => {
  
  it('should detect error rate spike', async () => {
    // Create simulated error spike event
    const event = AIEventFactory.create({
      type: AIEventType.ERROR_RATE_SPIKE,
      summary: 'Error rate spiked to 35% on POST /api/payments',
      severity: EventSeverity.CRITICAL,
      data: {
        metric: 'error_rate',
        current_value: 35,
        threshold: 10,
        window: '5m',
        trend: TrendDirection.RISING,
        unit: '%'
      },
      context: {
        route: '/api/payments',
        method: 'POST',
        recent_samples: [
          { timestamp: '2026-02-15T14:25:00Z', value: 2.5 },
          { timestamp: '2026-02-15T14:26:00Z', value: 5.0 },
          { timestamp: '2026-02-15T14:27:00Z', value: 15.0 },
          { timestamp: '2026-02-15T14:28:00Z', value: 28.0 },
          { timestamp: '2026-02-15T14:29:00Z', value: 35.0 },
        ]
      }
    });
    
    expect(event.confidence).toBeGreaterThan(0.8);
    expect(event.type).toBe(AIEventType.ERROR_RATE_SPIKE);
  });
  
  it('should analyze with Claude', async () => {
    const event = AIEventFactory.create({
      type: AIEventType.ERROR_RATE_SPIKE,
      summary: 'Test error spike',
      severity: EventSeverity.WARNING,
      data: {
        metric: 'error_rate',
        current_value: 15,
        threshold: 5,
        window: '5m',
        trend: TrendDirection.RISING,
        unit: '%'
      }
    });
    
    const result = await analyzeWithClaude(event);
    
    expect(result.analysis).toBeDefined();
    expect(result.analysis.confidence).toBeGreaterThan(0);
    expect(result.analysis.actions).toBeDefined();
    expect(result.cost).toBeLessThan(0.02); // Should be cheap
  }, 15000); // 15 second timeout for API call
  
});

Run tests:

bash
npm test -- tests/ai-incident-response.test.ts

# Expected output:
# ✓ should detect error rate spike (12ms)
# ✓ should analyze with Claude (2341ms)
# Tests: 2 passed, 2 total

Verification Checklist

Run through this checklist to ensure everything works:

✅ Configuration

  • [ ] Anthropic API key in .env
  • [ ] AI events enabled (AI_EVENTS_ENABLED=true)
  • [ ] Slack configured (optional)
  • [ ] Environment variables loaded

Test:

bash
node -e "console.log(process.env.ANTHROPIC_API_KEY ? 'API key found' : 'Missing API key')"

✅ Middleware Integration

  • [ ] aiMonitoring middleware added to Express
  • [ ] Metrics being collected
  • [ ] Buffer size reasonable (<1000 entries)

Test:

bash
curl http://localhost:3000/health
# Check logs for: "[AI] Request tracked"

✅ Event Detection

  • [ ] Error spike detection working
  • [ ] Latency anomaly detection working
  • [ ] Confidence scores >0.75

Test:

bash
# Simulate error spike
for i in {1..20}; do
  curl -X POST http://localhost:3000/api/test -w "\n" || true
done

# Check logs for: "[AI] Event detected: ERROR_RATE_SPIKE"

✅ Claude Integration

  • [ ] Claude API calls succeeding
  • [ ] JSON responses parsing correctly
  • [ ] Cost under budget (<$5/day)

Test:

bash
npm test -- tests/ai-incident-response.test.ts

✅ Slack Notifications

  • [ ] Notifications arriving in #incidents channel
  • [ ] Formatting looks good
  • [ ] Buttons work (acknowledge, view details)

Test:

bash
# Trigger manual notification
node -e "
  const { notifyIncident } = require('./services/slack-notifier');
  const { AIEventFactory } = require('@flexgate/ai');
  const event = AIEventFactory.createSample('ERROR_RATE_SPIKE');
  notifyIncident(event, { actions: ['Test action'] });
"

Production Deployment

Environment Setup

Staging:

bash
AI_EVENTS_ENABLED=true
AI_MIN_CONFIDENCE=0.70  # Lower threshold for testing
AI_MAX_COST_PER_DAY=2.00
SLACK_INCIDENT_CHANNEL=#staging-incidents

Production:

bash
AI_EVENTS_ENABLED=true
AI_MIN_CONFIDENCE=0.80  # Higher confidence required
AI_MAX_COST_PER_DAY=10.00
SLACK_INCIDENT_CHANNEL=#incidents

Gradual Rollout

Week 1: Shadow Mode

  • AI analysis runs but doesn't notify
  • Review analysis quality
  • Tune confidence thresholds

Week 2: Notify Only

  • Send Slack notifications
  • No auto-actions yet
  • Build team trust in AI recommendations

Week 3: Production

  • Full AI-assisted incident response
  • Monitor MTTR improvements
  • Collect feedback

Troubleshooting

Issue: No events detected

Symptoms:

  • No AI events in logs
  • No Slack notifications
  • metricsBuffer empty

Solutions:

  1. Check if middleware is loaded:

    typescript
    console.log('AI monitoring loaded:', !!aiMonitoring);
  2. Verify traffic is flowing:

    bash
    curl http://localhost:3000/api/test
    # Check logs for metric tracking
  3. Lower thresholds temporarily:

    typescript
    ERROR_RATE_SPIKE: { threshold: 5 } // Lower from 10

Issue: Claude API errors

Symptoms:

  • [Claude] Analysis failed: 401 Unauthorized
  • [Claude] Analysis failed: Rate limit exceeded

Solutions:

  1. Check API key:

    bash
    echo $ANTHROPIC_API_KEY
    # Should start with "sk-ant-api03-"
  2. Verify key has credits:

  3. Reduce analysis frequency:

    typescript
    budget: { maxAnalysesPerHour: 10 } // Lower from 20

Issue: Slack notifications not arriving

Symptoms:

  • No messages in #incidents channel
  • [Slack] Not configured, skipping notification

Solutions:

  1. Verify Slack app permissions:

    • chat:write
    • chat:write.public
  2. Check bot token:

    bash
    echo $SLACK_BOT_TOKEN
    # Should start with "xoxb-"
  3. Test manually:

    typescript
    slack.client.chat.postMessage({
      channel: '#incidents',
      text: 'Test message'
    });

Monitoring & Metrics

Track These KPIs

Incident Response Metrics:

typescript
// Add to your monitoring dashboard
metrics.gauge('ai.mttr_minutes', calculatedMTTR);
metrics.gauge('ai.events_per_hour', eventsThisHour);
metrics.gauge('ai.confidence_avg', avgConfidence);
metrics.counter('ai.claude_api_calls');
metrics.gauge('ai.cost_per_day', dailyCost);

Weekly Review:

  • MTTR: Should decrease 50-80% over 4 weeks
  • False positives: Target <15%
  • Cost: Should stay under $5/day
  • Engineer satisfaction: Survey team monthly

Example Dashboard

typescript
// Dashboard query example (Prometheus)
rate(api_errors_total[5m]) > 0.10  # Error rate >10%
histogram_quantile(0.99, api_latency_seconds) > 1.0  # p99 latency >1s
sum(ai_events_total) by (type)  # Events by type
sum(ai_analysis_cost_usd)  # Daily AI cost

Success Stories

Company: TechCorp (E-commerce)

Before:

  • MTTR: 52 minutes
  • Incidents/month: 15
  • Engineer on-call hours: 25/week

After (6 weeks):

  • MTTR: 9 minutes (83% improvement)
  • Incidents/month: 15 (same frequency)
  • Engineer on-call hours: 6/week (76% reduction)
  • False positives: 12%
  • ROI: $312,000 annually

Quote:

"Claude catches things we'd miss at 3 AM. The rollback recommendation alone saved us $15,000 in one incident." - Sarah Chen, SRE Lead


Next Steps

Now that you have AI-assisted incident response running:

  1. Measure baseline MTTR - Track current incident response time
  2. Run for 2 weeks - Let AI learn your patterns
  3. Review analysis quality - Are recommendations helpful?
  4. Tune thresholds - Adjust confidence, error rates
  5. Expand to other use cases:

Getting Help

Documentation:

Support:


Playbook Version: 1.0.0
Last Updated: February 15, 2026
Tested With: FlexGate v2.0.0, Claude 3.5 Sonnet

Released under the MIT License.