How-To Guides
Investigate Incident End-To-End
Run a complete incident workflow from signal intake to evidence-based closure.
Task Outcome
Incident handling remains controlled, reproducible, and auditable under time pressure.
When To Use This Guide
- Active production incidents.
- Repeated instability where root cause is unclear.
- Post-incident reviews that require evidence gap closure.
Prerequisites
- On-call owner is assigned.
- Access to diagnostics, approvals, and rollback modules is confirmed.
- Notification and audit channels are functioning.
UI Route Map
- Start from alert or incident signal.
- Open runtime diagnostics and recent delivery timeline.
- Correlate request/approval/execution history.
- Contain and remediate through approved workflow.
- Close incident with complete evidence package.
End-To-End Incident Sequence
Capture initial symptom, first seen time, and impacted scope.
Correlate runtime signals with recent release/promotion/rollback actions.
Select containment method: freeze, targeted mitigation, or rollback request.
Execute remediation through approval-governed path.
Verify service restoration and monitor stability window.
Complete closure checklist with evidence and preventive follow-up.
Containment Decision Guide
- Use when further deployments could amplify impact.
- Best for immediate blast-radius containment.
- Use when a recent change is strongly correlated with failure.
- Pick rollback mode with smallest effective scope.
- Use when issue is isolated and quickly reversible without broad rollback.
- Must still preserve audit trace and decision notes.
- Use only when incident urgency requires temporary override.
- Close immediately after emergency action.
Evidence Checklist
- Timeline: request -> approval -> execution -> runtime impact.
- Scope: tenant/environment/service details.
- Decision notes: why this remediation path was selected.
- Recovery proof: post-fix stability and drift status.
- Preventive action: owner and target date.
Common Failure Modes
- Jumping to rollback without correlation analysis.
- Skipping approval trail updates under pressure.
- Closing incident before stable observation window.
- Missing preventive action ownership.
Incident Report Template
Incident title:
First seen:
Impacted scope:
Timeline summary:
Containment action:
Remediation action:
Requester / approver / executor:
Recovery verification:
Preventive action + owner: