GitOpsHQ Docs
How-To Guides

Investigate Incident End-To-End

Run a complete incident workflow from signal intake to evidence-based closure.

Task Outcome

Incident handling remains controlled, reproducible, and auditable under time pressure.

When To Use This Guide

  • Active production incidents.
  • Repeated instability where root cause is unclear.
  • Post-incident reviews that require evidence gap closure.

Prerequisites

  • On-call owner is assigned.
  • Access to diagnostics, approvals, and rollback modules is confirmed.
  • Notification and audit channels are functioning.

UI Route Map

  1. Start from alert or incident signal.
  2. Open runtime diagnostics and recent delivery timeline.
  3. Correlate request/approval/execution history.
  4. Contain and remediate through approved workflow.
  5. Close incident with complete evidence package.

End-To-End Incident Sequence

Capture initial symptom, first seen time, and impacted scope.
Correlate runtime signals with recent release/promotion/rollback actions.
Select containment method: freeze, targeted mitigation, or rollback request.
Execute remediation through approval-governed path.
Verify service restoration and monitor stability window.
Complete closure checklist with evidence and preventive follow-up.

Containment Decision Guide

  • Use when further deployments could amplify impact.
  • Best for immediate blast-radius containment.
  • Use when a recent change is strongly correlated with failure.
  • Pick rollback mode with smallest effective scope.
  • Use when issue is isolated and quickly reversible without broad rollback.
  • Must still preserve audit trace and decision notes.
  • Use only when incident urgency requires temporary override.
  • Close immediately after emergency action.

Evidence Checklist

  • Timeline: request -> approval -> execution -> runtime impact.
  • Scope: tenant/environment/service details.
  • Decision notes: why this remediation path was selected.
  • Recovery proof: post-fix stability and drift status.
  • Preventive action: owner and target date.

Common Failure Modes

  • Jumping to rollback without correlation analysis.
  • Skipping approval trail updates under pressure.
  • Closing incident before stable observation window.
  • Missing preventive action ownership.

Incident Report Template

Incident title:
First seen:
Impacted scope:

Timeline summary:
Containment action:
Remediation action:

Requester / approver / executor:
Recovery verification:
Preventive action + owner:

Continue With

  1. Execute Rollback Safely
  2. Configure Notifications And Audit Trace
  3. SRE Playbook

On this page