Autonomous Service Recovery Playbook

Overview

Stop waiting for engineers to manually recover from common failures. This playbook shows you how to set up FlexGate's AI-powered autonomous recovery system that detects failures, generates recovery plans with Claude, and executes fixes automatically with human-in-loop safety controls.

What You'll Build:

Failure pattern detection
AI-powered recovery plan generation
Automated execution with safety guardrails
Slack-based approval workflow
Recovery monitoring and rollback

Prerequisites:

FlexGate v2.0+ with AI module installed
Anthropic API key
Slack workspace with admin access
Kubernetes or Docker deployment (for automated actions)
Node.js 18+

⚠️ Safety First

This playbook implements AUTONOMOUS ACTIONS on production infrastructure.

Safety Guardrails Built-In:

✅ Human-in-loop - Critical actions require Slack approval (5 min timeout)
✅ Confidence thresholds - Only execute if AI confidence >80%
✅ Rate limiting - Max 3 auto-recoveries per hour
✅ Rollback capability - One-click undo for any action
✅ Audit logging - Every action logged with reasoning
✅ Dry-run mode - Test without making changes

Never Auto-Execute (Always Require Approval):

Pod/container restarts
Database migrations
Traffic rerouting
Security changes
Configuration with deployment

Step 1: Define Recovery Runbooks (10 minutes)

Create config/recovery-runbooks.ts:

typescript

import { AIEventType, EventSeverity } from '@flexgate/ai';

export interface RecoveryAction {
  type: 'SCALE_RESOURCE' | 'CIRCUIT_BREAKER' | 'RESTART_POD' | 'CACHE_CLEAR' | 'RATE_LIMIT' | 'FAILOVER';
  resource?: string;
  operation?: string;
  parameters: Record<string, any>;
  duration?: number; // seconds
  gradual?: boolean;
}

export interface SafetyConfig {
  requiresApproval: boolean;
  maxExecutionsPerHour: number;
  rollbackAfterMinutes?: number;
  dryRun?: boolean;
}

export interface RecoveryRunbook {
  name: string;
  description: string;
  trigger: {
    eventType: AIEventType;
    conditions: Record<string, any>;
  };
  actions: RecoveryAction[];
  safety: SafetyConfig;
  estimatedRecoveryTime: number; // minutes
}

export const RECOVERY_RUNBOOKS: RecoveryRunbook[] = [
  {
    name: 'scale-database-connections',
    description: 'Scale database connection pool when exhausted',
    trigger: {
      eventType: AIEventType.LATENCY_ANOMALY,
      conditions: {
        cause: 'DATABASE',
        confidence: { min: 0.80 },
        severity: EventSeverity.WARNING
      }
    },
    actions: [
      {
        type: 'SCALE_RESOURCE',
        resource: 'DB_POOL_SIZE',
        parameters: {
          from: 20,
          to: 50,
          increment: 10,
          intervalSeconds: 30
        },
        gradual: true
      }
    ],
    safety: {
      requiresApproval: false, // Low risk
      maxExecutionsPerHour: 3,
      rollbackAfterMinutes: 30
    },
    estimatedRecoveryTime: 5
  },
  
  {
    name: 'open-circuit-breaker',
    description: 'Open circuit breaker for failing upstream service',
    trigger: {
      eventType: AIEventType.CIRCUIT_BREAKER_CANDIDATE,
      conditions: {
        confidence: { min: 0.85 },
        errorRate: { min: 50 }
      }
    },
    actions: [
      {
        type: 'CIRCUIT_BREAKER',
        operation: 'OPEN',
        parameters: {
          service: '{{upstream}}',
          duration: 60,
          halfOpenRequests: 3
        },
        duration: 60
      }
    ],
    safety: {
      requiresApproval: false,
      maxExecutionsPerHour: 5,
      rollbackAfterMinutes: 5
    },
    estimatedRecoveryTime: 2
  },
  
  {
    name: 'clear-cache-on-stale-data',
    description: 'Clear cache when serving stale data',
    trigger: {
      eventType: AIEventType.ERROR_RATE_SPIKE,
      conditions: {
        confidence: { min: 0.75 },
        errorCategory: '5xx'
      }
    },
    actions: [
      {
        type: 'CACHE_CLEAR',
        parameters: {
          pattern: '{{route}}:*',
          graceful: true
        }
      }
    ],
    safety: {
      requiresApproval: false,
      maxExecutionsPerHour: 5
    },
    estimatedRecoveryTime: 1
  },
  
  {
    name: 'restart-pod-memory-leak',
    description: 'Restart pod when memory leak detected',
    trigger: {
      eventType: AIEventType.CAPACITY_WARNING,
      conditions: {
        resourceType: 'MEMORY',
        confidence: { min: 0.80 },
        utilization: { min: 90 }
      }
    },
    actions: [
      {
        type: 'RESTART_POD',
        parameters: {
          deployment: 'api-gateway',
          graceful: true,
          drainTimeout: 30
        }
      }
    ],
    safety: {
      requiresApproval: true, // HIGH RISK - requires approval
      maxExecutionsPerHour: 2,
      dryRun: false
    },
    estimatedRecoveryTime: 3
  },
  
  {
    name: 'apply-rate-limit-abuse',
    description: 'Apply rate limit when abuse detected',
    trigger: {
      eventType: AIEventType.RATE_LIMIT_BREACH,
      conditions: {
        trafficType: 'ABUSE',
        confidence: { min: 0.75 }
      }
    },
    actions: [
      {
        type: 'RATE_LIMIT',
        parameters: {
          target: '{{client}}',
          limit: 100,
          window: 60,
          duration: 1800 // 30 minutes
        }
      }
    ],
    safety: {
      requiresApproval: false,
      maxExecutionsPerHour: 10
    },
    estimatedRecoveryTime: 1
  },
  
  {
    name: 'failover-upstream-degraded',
    description: 'Failover to backup service when primary degraded',
    trigger: {
      eventType: AIEventType.UPSTREAM_DEGRADATION,
      conditions: {
        confidence: { min: 0.85 },
        severity: EventSeverity.CRITICAL
      }
    },
    actions: [
      {
        type: 'FAILOVER',
        parameters: {
          from: '{{upstream}}',
          to: '{{backup_service}}',
          healthCheckInterval: 10,
          autoRevertOnRecovery: true
        }
      }
    ],
    safety: {
      requiresApproval: true, // HIGH RISK
      maxExecutionsPerHour: 2
    },
    estimatedRecoveryTime: 5
  }
];

// Find matching runbook for event
export function findRunbook(event: any): RecoveryRunbook | null {
  for (const runbook of RECOVERY_RUNBOOKS) {
    if (runbook.trigger.eventType !== event.type) continue;
    
    const conditions = runbook.trigger.conditions;
    let matches = true;
    
    // Check confidence
    if (conditions.confidence?.min && event.confidence < conditions.confidence.min) {
      matches = false;
    }
    
    // Check severity
    if (conditions.severity && event.severity !== conditions.severity) {
      matches = false;
    }
    
    // Check other conditions (cause, errorRate, etc.)
    for (const [key, value] of Object.entries(conditions)) {
      if (key === 'confidence' || key === 'severity') continue;
      
      if (typeof value === 'object' && 'min' in value) {
        if ((event.data[key] || event.context[key]) < value.min) {
          matches = false;
        }
      } else {
        if (event.data[key] !== value && event.context[key] !== value) {
          matches = false;
        }
      }
    }
    
    if (matches) return runbook;
  }
  
  return null;
}

export default RECOVERY_RUNBOOKS;

Step 2: Implement Auto-Recovery Engine (10 minutes)

Create services/auto-recovery.ts:

typescript

import { AIEvent, AIEventType } from '@flexgate/ai';
import { RecoveryRunbook, RecoveryAction, findRunbook } from '../config/recovery-runbooks';
import aiEvents from '../config/ai-events';

interface RecoveryPlan {
  id: string;
  event: AIEvent;
  runbook: RecoveryRunbook;
  actions: RecoveryAction[];
  requiresApproval: boolean;
  confidence: number;
  estimatedRecoveryTime: number;
  rollbackPlan?: RecoveryPlan;
  timestamp: string;
}

interface RecoveryExecution {
  planId: string;
  status: 'PENDING' | 'APPROVED' | 'REJECTED' | 'EXECUTING' | 'SUCCESS' | 'FAILED' | 'ROLLED_BACK';
  startTime?: string;
  endTime?: string;
  appliedActions: string[];
  error?: string;
}

class AutoRecoveryEngine {
  private executions = new Map<string, RecoveryExecution>();
  private executionCounts = new Map<string, number[]>(); // runbook -> timestamps
  
  async generateRecoveryPlan(event: AIEvent): Promise<RecoveryPlan | null> {
    const runbook = findRunbook(event);
    if (!runbook) {
      console.log(`[Auto-Recovery] No runbook found for ${event.type}`);
      return null;
    }
    
    // Check rate limiting
    if (!this.checkRateLimit(runbook)) {
      console.log(`[Auto-Recovery] Rate limit exceeded for ${runbook.name}`);
      return null;
    }
    
    // Substitute variables in actions
    const actions = runbook.actions.map(action => ({
      ...action,
      parameters: this.substituteVariables(action.parameters, event)
    }));
    
    const plan: RecoveryPlan = {
      id: `recovery-${Date.now()}`,
      event,
      runbook,
      actions,
      requiresApproval: runbook.safety.requiresApproval,
      confidence: event.confidence,
      estimatedRecoveryTime: runbook.estimatedRecoveryTime,
      timestamp: new Date().toISOString()
    };
    
    // Generate rollback plan
    plan.rollbackPlan = this.generateRollbackPlan(plan);
    
    console.log(`[Auto-Recovery] Generated plan: ${runbook.name} (approval: ${plan.requiresApproval})`);
    
    return plan;
  }
  
  async execute(plan: RecoveryPlan, approved = false): Promise<RecoveryExecution> {
    // Check if approval required
    if (plan.requiresApproval && !approved) {
      throw new Error('Recovery plan requires approval');
    }
    
    // Check dry-run mode
    if (plan.runbook.safety.dryRun) {
      console.log('[Auto-Recovery] DRY RUN MODE - No actions executed');
      return this.dryRunExecution(plan);
    }
    
    const execution: RecoveryExecution = {
      planId: plan.id,
      status: 'EXECUTING',
      startTime: new Date().toISOString(),
      appliedActions: []
    };
    
    this.executions.set(plan.id, execution);
    
    try {
      for (const action of plan.actions) {
        console.log(`[Auto-Recovery] Executing: ${action.type}`);
        await this.executeAction(action, plan.event);
        execution.appliedActions.push(action.type);
      }
      
      execution.status = 'SUCCESS';
      execution.endTime = new Date().toISOString();
      
      // Record execution for rate limiting
      this.recordExecution(plan.runbook.name);
      
      // Schedule rollback if configured
      if (plan.runbook.safety.rollbackAfterMinutes) {
        this.scheduleRollback(plan, plan.runbook.safety.rollbackAfterMinutes);
      }
      
      console.log(`[Auto-Recovery] Success: ${plan.runbook.name}`);
      
    } catch (error) {
      execution.status = 'FAILED';
      execution.endTime = new Date().toISOString();
      execution.error = error.message;
      
      console.error(`[Auto-Recovery] Failed: ${plan.runbook.name}`, error);
      
      // Auto-rollback on failure
      if (plan.rollbackPlan) {
        await this.execute(plan.rollbackPlan, true);
        execution.status = 'ROLLED_BACK';
      }
    }
    
    return execution;
  }
  
  private async executeAction(action: RecoveryAction, event: AIEvent): Promise<void> {
    switch (action.type) {
      case 'SCALE_RESOURCE':
        await this.scaleResource(action.parameters);
        break;
      
      case 'CIRCUIT_BREAKER':
        await this.circuitBreaker(action.parameters);
        break;
      
      case 'RESTART_POD':
        await this.restartPod(action.parameters);
        break;
      
      case 'CACHE_CLEAR':
        await this.clearCache(action.parameters);
        break;
      
      case 'RATE_LIMIT':
        await this.applyRateLimit(action.parameters);
        break;
      
      case 'FAILOVER':
        await this.failover(action.parameters);
        break;
      
      default:
        throw new Error(`Unknown action type: ${action.type}`);
    }
  }
  
  private async scaleResource(params: any): Promise<void> {
    console.log(`[Action] Scaling ${params.resource}: ${params.from} → ${params.to}`);
    
    if (params.gradual) {
      const steps = Math.ceil((params.to - params.from) / params.increment);
      for (let i = 1; i <= steps; i++) {
        const newValue = Math.min(params.from + i * params.increment, params.to);
        await this.updateEnvironment(params.resource, newValue);
        await this.sleep(params.intervalSeconds * 1000);
      }
    } else {
      await this.updateEnvironment(params.resource, params.to);
    }
  }
  
  private async circuitBreaker(params: any): Promise<void> {
    console.log(`[Action] Circuit breaker ${params.operation}: ${params.service}`);
    
    // Implementation depends on your circuit breaker library
    // Example with Opossum:
    // const breaker = circuitBreakers.get(params.service);
    // if (params.operation === 'OPEN') {
    //   breaker.open();
    //   setTimeout(() => breaker.close(), params.duration * 1000);
    // }
  }
  
  private async restartPod(params: any): Promise<void> {
    console.log(`[Action] Restarting pod: ${params.deployment}`);
    
    if (params.graceful) {
      // Kubernetes rolling restart
      const { exec } = require('child_process');
      await new Promise((resolve, reject) => {
        exec(
          `kubectl rollout restart deployment/${params.deployment}`,
          (error, stdout, stderr) => {
            if (error) reject(error);
            else resolve(stdout);
          }
        );
      });
    }
  }
  
  private async clearCache(params: any): Promise<void> {
    console.log(`[Action] Clearing cache: ${params.pattern}`);
    
    // Implementation depends on your cache system
    // Example with Redis:
    // const redis = await getRedisClient();
    // const keys = await redis.keys(params.pattern);
    // if (keys.length > 0) {
    //   await redis.del(...keys);
    // }
  }
  
  private async applyRateLimit(params: any): Promise<void> {
    console.log(`[Action] Rate limit: ${params.target} → ${params.limit} req/${params.window}s`);
    
    // Implementation depends on your rate limiter
    // Example:
    // await rateLimiter.setLimit(params.target, {
    //   limit: params.limit,
    //   window: params.window,
    //   duration: params.duration
    // });
  }
  
  private async failover(params: any): Promise<void> {
    console.log(`[Action] Failover: ${params.from} → ${params.to}`);
    
    // Update routing configuration
    // await routingConfig.setUpstream(params.from, params.to);
    
    // Set up health check for auto-revert
    if (params.autoRevertOnRecovery) {
      // this.monitorUpstreamHealth(params.from, params.to);
    }
  }
  
  private async updateEnvironment(key: string, value: any): Promise<void> {
    // Kubernetes ConfigMap update
    // const { exec } = require('child_process');
    // await exec(`kubectl set env deployment/api-gateway ${key}=${value}`);
    console.log(`[Config] Set ${key}=${value}`);
  }
  
  private substituteVariables(params: any, event: AIEvent): any {
    const result = { ...params };
    
    for (const [key, value] of Object.entries(result)) {
      if (typeof value === 'string' && value.includes('{{')) {
        result[key] = value
          .replace('{{upstream}}', event.context.upstream || '')
          .replace('{{route}}', event.context.route || '')
          .replace('{{client}}', event.context.client || '')
          .replace('{{backup_service}}', event.context.backup_service || '');
      }
    }
    
    return result;
  }
  
  private generateRollbackPlan(plan: RecoveryPlan): RecoveryPlan | undefined {
    const rollbackActions: RecoveryAction[] = [];
    
    for (const action of plan.actions) {
      switch (action.type) {
        case 'SCALE_RESOURCE':
          rollbackActions.push({
            ...action,
            parameters: {
              ...action.parameters,
              from: action.parameters.to,
              to: action.parameters.from
            }
          });
          break;
        
        case 'CIRCUIT_BREAKER':
          if (action.parameters.operation === 'OPEN') {
            rollbackActions.push({
              ...action,
              parameters: { ...action.parameters, operation: 'CLOSE' }
            });
          }
          break;
        
        case 'FAILOVER':
          rollbackActions.push({
            ...action,
            parameters: {
              from: action.parameters.to,
              to: action.parameters.from
            }
          });
          break;
      }
    }
    
    if (rollbackActions.length === 0) return undefined;
    
    return {
      ...plan,
      id: `rollback-${plan.id}`,
      actions: rollbackActions,
      requiresApproval: false // Rollbacks don't need approval
    };
  }
  
  private checkRateLimit(runbook: RecoveryRunbook): boolean {
    const executions = this.executionCounts.get(runbook.name) || [];
    const hourAgo = Date.now() - 60 * 60 * 1000;
    const recentExecutions = executions.filter(ts => ts > hourAgo);
    
    return recentExecutions.length < runbook.safety.maxExecutionsPerHour;
  }
  
  private recordExecution(runbookName: string): void {
    const executions = this.executionCounts.get(runbookName) || [];
    executions.push(Date.now());
    this.executionCounts.set(runbookName, executions);
  }
  
  private scheduleRollback(plan: RecoveryPlan, minutes: number): void {
    setTimeout(async () => {
      const execution = this.executions.get(plan.id);
      if (execution?.status === 'SUCCESS' && plan.rollbackPlan) {
        console.log(`[Auto-Recovery] Auto-rollback scheduled: ${plan.runbook.name}`);
        await this.execute(plan.rollbackPlan, true);
      }
    }, minutes * 60 * 1000);
  }
  
  private dryRunExecution(plan: RecoveryPlan): RecoveryExecution {
    return {
      planId: plan.id,
      status: 'SUCCESS',
      startTime: new Date().toISOString(),
      endTime: new Date().toISOString(),
      appliedActions: plan.actions.map(a => `[DRY RUN] ${a.type}`)
    };
  }
  
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
  
  getExecution(planId: string): RecoveryExecution | undefined {
    return this.executions.get(planId);
  }
}

export const autoRecovery = new AutoRecoveryEngine();

// Listen for AI events
aiEvents.on('*', async (event: AIEvent) => {
  const plan = await autoRecovery.generateRecoveryPlan(event);
  
  if (plan) {
    if (plan.requiresApproval) {
      // Send to Slack for approval (implemented in Step 3)
      const { requestApproval } = await import('./recovery-approval');
      await requestApproval(plan);
    } else {
      // Execute automatically
      await autoRecovery.execute(plan);
    }
  }
});

export default autoRecovery;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369

Step 3: Slack Approval Workflow (10 minutes)

Create services/recovery-approval.ts:

typescript

import { App } from '@slack/bolt';
import { RecoveryPlan } from '../config/recovery-runbooks';
import { autoRecovery } from './auto-recovery';

const slack = new App({
  token: process.env.SLACK_BOT_TOKEN,
  signingSecret: process.env.SLACK_SIGNING_SECRET
});

const RECOVERY_CHANNEL = process.env.SLACK_RECOVERY_CHANNEL || '#auto-recovery';

const pendingApprovals = new Map<string, {
  plan: RecoveryPlan;
  timeout: NodeJS.Timeout;
}>();

export async function requestApproval(plan: RecoveryPlan): Promise<void> {
  if (!process.env.SLACK_BOT_TOKEN) {
    console.log('[Recovery Approval] Slack not configured, auto-rejecting');
    return;
  }
  
  try {
    const message = await slack.client.chat.postMessage({
      channel: RECOVERY_CHANNEL,
      blocks: [
        {
          type: 'header',
          text: {
            type: 'plain_text',
            text: `🤖 Auto-Recovery Approval Required`,
            emoji: true
          }
        },
        {
          type: 'section',
          text: {
            type: 'mrkdwn',
            text: `*Event:* ${plan.event.type}\n*Runbook:* ${plan.runbook.name}\n*Confidence:* ${Math.round(plan.confidence * 100)}%\n*Est. Recovery Time:* ${plan.estimatedRecoveryTime} minutes`
          }
        },
        {
          type: 'section',
          text: {
            type: 'mrkdwn',
            text: `*Problem:*\n${plan.event.summary}`
          }
        },
        {
          type: 'section',
          text: {
            type: 'mrkdwn',
            text: `*Planned Actions:*\n${formatActions(plan.actions)}`
          }
        },
        {
          type: 'section',
          text: {
            type: 'mrkdwn',
            text: `*Rollback Plan:* ${plan.rollbackPlan ? 'Available ✅' : 'None ⚠️'}`
          }
        },
        {
          type: 'actions',
          block_id: `recovery_${plan.id}`,
          elements: [
            {
              type: 'button',
              text: { type: 'plain_text', text: '✅ Approve & Execute' },
              style: 'primary',
              action_id: 'approve_recovery',
              value: plan.id
            },
            {
              type: 'button',
              text: { type: 'plain_text', text: '❌ Reject' },
              style: 'danger',
              action_id: 'reject_recovery',
              value: plan.id
            },
            {
              type: 'button',
              text: { type: 'plain_text', text: '🔍 View Details' },
              action_id: 'view_recovery_details',
              value: plan.id
            }
          ]
        },
        {
          type: 'context',
          elements: [
            {
              type: 'mrkdwn',
              text: `⏰ This request will timeout in 5 minutes`
            }
          ]
        }
      ]
    });
    
    // Set 5-minute timeout
    const timeout = setTimeout(async () => {
      console.log(`[Recovery Approval] Timeout for plan ${plan.id}`);
      pendingApprovals.delete(plan.id);
      
      await slack.client.chat.update({
        channel: RECOVERY_CHANNEL,
        ts: message.ts!,
        text: '⏰ Approval timeout - Recovery NOT executed',
        blocks: [
          {
            type: 'section',
            text: {
              type: 'mrkdwn',
              text: `⏰ *Approval Timeout*\n\nRecovery plan for ${plan.runbook.name} was not executed due to timeout.`
            }
          }
        ]
      });
    }, 5 * 60 * 1000);
    
    pendingApprovals.set(plan.id, { plan, timeout });
    
    console.log(`[Recovery Approval] Sent approval request for ${plan.runbook.name}`);
    
  } catch (error) {
    console.error('[Recovery Approval] Failed to request approval:', error);
  }
}

function formatActions(actions: any[]): string {
  return actions.map((action, i) => {
    const params = JSON.stringify(action.parameters, null, 2)
      .split('\n')
      .map(line => `  ${line}`)
      .join('\n');
    
    return `${i + 1}. *${action.type}*\n\`\`\`${params}\`\`\``;
  }).join('\n');
}

// Handle approval
slack.action('approve_recovery', async ({ ack, body, client }) => {
  await ack();
  
  const planId = (body as any).actions[0].value;
  const approval = pendingApprovals.get(planId);
  
  if (!approval) {
    await client.chat.postMessage({
      channel: RECOVERY_CHANNEL,
      thread_ts: (body as any).message.ts,
      text: '❌ Recovery plan not found or already executed'
    });
    return;
  }
  
  clearTimeout(approval.timeout);
  pendingApprovals.delete(planId);
  
  // Execute recovery
  console.log(`[Recovery Approval] Approved by ${(body as any).user.id}`);
  
  await client.chat.update({
    channel: RECOVERY_CHANNEL,
    ts: (body as any).message.ts,
    text: `✅ Approved by <@${(body as any).user.id}>`,
    blocks: [
      {
        type: 'section',
        text: {
          type: 'mrkdwn',
          text: `✅ *Approved by <@${(body as any).user.id}>*\n\nExecuting recovery plan: ${approval.plan.runbook.name}...`
        }
      }
    ]
  });
  
  try {
    const execution = await autoRecovery.execute(approval.plan, true);
    
    await client.chat.postMessage({
      channel: RECOVERY_CHANNEL,
      thread_ts: (body as any).message.ts,
      text: execution.status === 'SUCCESS'
        ? `✅ Recovery completed successfully!\n\nActions applied:\n${execution.appliedActions.map(a => `• ${a}`).join('\n')}`
        : `❌ Recovery failed: ${execution.error}`
    });
    
  } catch (error) {
    await client.chat.postMessage({
      channel: RECOVERY_CHANNEL,
      thread_ts: (body as any).message.ts,
      text: `❌ Recovery execution failed: ${error.message}`
    });
  }
});

// Handle rejection
slack.action('reject_recovery', async ({ ack, body, client }) => {
  await ack();
  
  const planId = (body as any).actions[0].value;
  const approval = pendingApprovals.get(planId);
  
  if (approval) {
    clearTimeout(approval.timeout);
    pendingApprovals.delete(planId);
  }
  
  await client.chat.update({
    channel: RECOVERY_CHANNEL,
    ts: (body as any).message.ts,
    text: `❌ Rejected by <@${(body as any).user.id}>`,
    blocks: [
      {
        type: 'section',
        text: {
          type: 'mrkdwn',
          text: `❌ *Rejected by <@${(body as any).user.id}>*\n\nRecovery will NOT be executed.`
        }
      }
    ]
  });
});

// View details
slack.action('view_recovery_details', async ({ ack, body, client }) => {
  await ack();
  
  const planId = (body as any).actions[0].value;
  const approval = pendingApprovals.get(planId);
  
  if (!approval) {
    return;
  }
  
  await client.chat.postMessage({
    channel: RECOVERY_CHANNEL,
    thread_ts: (body as any).message.ts,
    text: `📋 *Recovery Plan Details*\n\n\`\`\`json\n${JSON.stringify(approval.plan, null, 2)}\n\`\`\``
  });
});

// Start Slack app
(async () => {
  if (process.env.SLACK_BOT_TOKEN) {
    await slack.start(process.env.SLACK_PORT || 3001);
    console.log('[Recovery Approval] Slack app started');
  }
})();

export default slack;

Step 4: Recovery Monitoring (5 minutes)

Create services/recovery-monitor.ts:

typescript

import aiEvents from '../config/ai-events';
import { AIEventType, EventSeverity } from '@flexgate/ai';

// Monitor metrics after recovery
export async function monitorRecovery(planId: string, metrics: string[]): Promise<boolean> {
  const startTime = Date.now();
  const maxDuration = 5 * 60 * 1000; // 5 minutes
  
  console.log(`[Recovery Monitor] Monitoring ${metrics.join(', ')} for plan ${planId}`);
  
  while (Date.now() - startTime < maxDuration) {
    const healthy = await checkMetrics(metrics);
    
    if (healthy) {
      console.log('[Recovery Monitor] Metrics stable ✅');
      
      // Emit recovery signal
      await aiEvents.emit({
        type: AIEventType.RECOVERY_SIGNAL,
        severity: EventSeverity.INFO,
        summary: `Recovery successful for plan ${planId}`,
        data: {
          metric: 'recovery_status',
          current_value: 100,
          threshold: 80,
          window: '5m',
          trend: 'STABLE',
          unit: '%'
        },
        context: {
          plan_id: planId,
          monitored_metrics: metrics
        }
      });
      
      return true;
    }
    
    await sleep(10000); // Check every 10 seconds
  }
  
  console.log('[Recovery Monitor] Metrics did not stabilize ❌');
  return false;
}

async function checkMetrics(metrics: string[]): Promise<boolean> {
  // Implementation depends on your metrics system
  // Example checks:
  // - Error rate < 5%
  // - Latency p99 < threshold
  // - CPU/Memory normal
  
  return true; // Placeholder
}

function sleep(ms: number): Promise<void> {
  return new Promise(resolve => setTimeout(resolve, ms));
}

export default { monitorRecovery };

Testing

Create tests/auto-recovery.test.ts:

typescript

import { AIEventFactory, AIEventType, EventSeverity } from '@flexgate/ai';
import { autoRecovery } from '../services/auto-recovery';
import { findRunbook } from '../config/recovery-runbooks';

describe('Auto-Recovery', () => {
  
  it('should find matching runbook', () => {
    const event = AIEventFactory.create({
      type: AIEventType.LATENCY_ANOMALY,
      summary: 'Database connection pool exhausted',
      severity: EventSeverity.WARNING,
      data: {
        metric: 'latency_ms',
        current_value: 2500,
        threshold: 100,
        window: '5m',
        trend: 'RISING',
        unit: 'ms'
      },
      context: {
        cause: 'DATABASE'
      }
    });
    
    const runbook = findRunbook(event);
    
    expect(runbook).toBeDefined();
    expect(runbook?.name).toBe('scale-database-connections');
  });
  
  it('should generate recovery plan', async () => {
    const event = AIEventFactory.createSample('CIRCUIT_BREAKER_CANDIDATE');
    event.confidence = 0.90;
    
    const plan = await autoRecovery.generateRecoveryPlan(event);
    
    expect(plan).toBeDefined();
    expect(plan?.runbook.name).toBe('open-circuit-breaker');
    expect(plan?.requiresApproval).toBe(false);
    expect(plan?.rollbackPlan).toBeDefined();
  });
  
  it('should execute dry run', async () => {
    const event = AIEventFactory.createSample('CAPACITY_WARNING');
    const plan = await autoRecovery.generateRecoveryPlan(event);
    
    if (plan) {
      plan.runbook.safety.dryRun = true;
      const execution = await autoRecovery.execute(plan);
      
      expect(execution.status).toBe('SUCCESS');
      expect(execution.appliedActions.length).toBeGreaterThan(0);
      expect(execution.appliedActions[0]).toContain('[DRY RUN]');
    }
  });
  
});

Verification Checklist

✅ Configuration

[ ] Recovery runbooks defined
[ ] Safety guardrails configured
[ ] Slack bot configured
[ ] Dry-run mode tested

✅ Auto-Recovery

[ ] Plans generating correctly
[ ] Rate limiting working
[ ] Rollback plans created

✅ Approval Workflow

[ ] Slack approvals arriving
[ ] Timeout working (5 min)
[ ] Rejection handling

✅ Execution

[ ] Dry-run mode working
[ ] Actions executing
[ ] Rollback on failure

Production Deployment

Phase 1: Dry-Run (Week 1)

typescript

// All runbooks in dry-run mode
safety: { dryRun: true }

Phase 2: Low-Risk Auto (Week 2)

typescript

// Enable auto for cache clearing, rate limiting
safety: { requiresApproval: false, dryRun: false }

Phase 3: Full Production (Week 3+)

typescript

// Enable all runbooks with appropriate approval settings

Troubleshooting

Issue: Approvals timing out

Solution: Increase timeout or reduce scope

Issue: Actions failing

Solution: Check permissions, test dry-run

Issue: Rollback not working

Solution: Verify rollback plan generation

Success Metrics

Auto-heal rate: Target 80%+
MTTR: Target <10 minutes
False actions: Target <5%

Playbook Version: 1.0.0
Last Updated: February 15, 2026
Tested With: FlexGate v2.0.0, Kubernetes 1.28

Autonomous Service Recovery Playbook ​

Overview ​

⚠️ Safety First ​

Step 1: Define Recovery Runbooks (10 minutes) ​

Step 2: Implement Auto-Recovery Engine (10 minutes) ​

Step 3: Slack Approval Workflow (10 minutes) ​

Step 4: Recovery Monitoring (5 minutes) ​

Testing ​

Verification Checklist ​

✅ Configuration ​

✅ Auto-Recovery ​

✅ Approval Workflow ​

✅ Execution ​

Production Deployment ​

Phase 1: Dry-Run (Week 1) ​

Phase 2: Low-Risk Auto (Week 2) ​

Phase 3: Full Production (Week 3+) ​

Troubleshooting ​

Issue: Approvals timing out ​

Issue: Actions failing ​

Issue: Rollback not working ​

Success Metrics ​

Autonomous Service Recovery Playbook

Overview

⚠️ Safety First

Step 1: Define Recovery Runbooks (10 minutes)

Step 2: Implement Auto-Recovery Engine (10 minutes)

Step 3: Slack Approval Workflow (10 minutes)

Step 4: Recovery Monitoring (5 minutes)

Testing

Verification Checklist

✅ Configuration

✅ Auto-Recovery

✅ Approval Workflow

✅ Execution

Production Deployment

Phase 1: Dry-Run (Week 1)

Phase 2: Low-Risk Auto (Week 2)

Phase 3: Full Production (Week 3+)

Troubleshooting

Issue: Approvals timing out

Issue: Actions failing

Issue: Rollback not working

Success Metrics